Operations
SpatialHadoop ships with many spatial operations that work efficiently on a cluster and scales
up to terabytes of data. This page describes these operations with all possible parameters
and customizations. You can list down all supported operations by types bin/shadoop.
You can also get a quick description and arguments for a specific operation by typing
bin/shadoop <operation>
index
The index command is used to construct a spatial index for a given input file.
Currently, there are three spatial indexes which can be constructed in SpatialHadoop, namely,
Grid, R-tree and R+-tree.
A grid index partitions the data according to a uniform grid based on the input file size.
For uniform data, this index is expected to partition the input file,
into equally sized partition, each of size 64MB. If an input record (e.g., rectangle) overlaps
multiple partitions, it is replicated to all overlapping partitions.
R-tree and R+-tree are both used for skewed input data where a uniform grid does not work
properly. In R-tree, each input record is stored in only one partition. The boundaries
of each partition may need to expand to enclose all contained records which means that
partitions might overlap. On the other side, in R+-tree, partitions are kept disjoint while
records are replicated to all overlapping partitions.
bin/shadoop index <input> <output> shape:<input format>
sindex:<index> blocksize:<size> -overwrite
Arguments
- input, output - Path to input and output files
-
shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- sindex:index - Type of index to construct, either grid, rtree or
r+tree.
- blocksize:size - Block size for generated file. By default, it uses the default
block size in the output directory.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
Range query
The rangequery command is used to run a spatial range query for an input file. If the
input file is indexed, the index will be accessed to speed up the execution of the query.
Otherwise, the whole file will be scanned to retrieve matching records.
bin/shadoop rangequery <input> <output> shape:<input format>
rect:<rectangle> -overwrite
- input, output - paths to input and output. If input file is indexed, the index
will be accessed to speed up the query execution. If output path is not provided, the output
will be discarded which is useful for benchmarking if the actual output is not important.
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- rect:rectangle
- Dimensions of the query rectangle in the format x1,y1,x2,y2 where
(x1,y1) is the coordinate of the minimum corner of the query ractangle while (x2,y2) is the
coordinate of the maximum corner of it.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
k Nearest Neighbor
The knn operation is used to run a k nearest neighbor query for an input file. Given
a query point Q, this operation returns the k closest records to Q. If the
input file is indexed, the index will be accessed to speed up the execution of the query.
Otherwise, the whole file will be scanned to retrieve correct answer.
bin/shadoop knn <input> <output> shape:<input format>
point:<x,y> k:<k> -overwrite
- input, output - paths to input and output. If input file is indexed, the index
will be accessed to speed up the query execution. If output path is not provided, the output
will be discarded which is useful for benchmarking if the actual output is not important.
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- point:x,y
- Coordinates of the query point.
- k:k
- Query parameter k, i.e., maximum number of records to return.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
Disributed Join
The dj operation performs a spatial join between two files. If both files are
indexed, the distributed join algorithm joins every pair of overlapping partitions which is
very scalable and efficient for large files. If at least one of the files is not indexed,
the algorithm reduces to a simple block nested loop join which run in a distributed
environment in SpatialHadoop.
If both files are indexed, the algorithm might automatically choose to repartition of the
files to match the partitions used by the other file which, in some cases, might increase
the performance of the operation by reducing total number of map tasks. This option can be
overridden by command parameters.
bin/shadoop dj <input1> <input2> <output> shape:<input format>
repartition:<yest|no|auto> -overwrite
- input, output - paths to input and output. If input file is indexed, the index
will be accessed to speed up the query execution. If output path is not provided, the output
will be discarded which is useful for benchmarking if the actual output is not important.
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- repartition:yes|no|auto
- Whether to repartition one of the files to speedup the
execution or not. yes will always repartition the smaller file. no will
never repartition any of the files. auto which is the default, will automatically
decide whether to repartition or not based on a simple cost model.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
SJMR
SJMR is the MapReduce implementation of the
Partition Based Spatial Merge Join
by Jignesh Patel and David DeWitt. This operation is designed to perform spatial join efficiently
for non-indexed files. This algorithm partitions both files according to a uniform grid and
joins every pair of matching grid cells. The dimensions of the partitioning grid are
automatically determined based on input file sizes such that average partition size is equal
to HDFS block size (e.g., 64MB). If one or both of the input files are skewed, the performance
of this algorithm may deteriorate.
bin/shadoop sjmr <input1> <input2> <output> shape:<input format>
-overwrite
- input, output - paths to input and output. If input file is indexed, the index
will be accessed to speed up the query execution. If output path is not provided, the output
will be discarded which is useful for benchmarking if the actual output is not important.
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
Minimal Bounding Rectangle
The mbr operation determines the minimal bounding rectangle (MBR) of an input file.
If the file is indexed, the MBR is retrieved directly from the index. Otherwise, the whole
file is scanned and the MBR is calculated as a simple aggregate function.
bin/shadoop mbr <input> shape:<input format>
- input - path to input file. No output file is provided as the result is directly
written to standard output. If the input is a directory, the output is cached by writing a
master file in the same directory.
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
Read file
Reads an indexed file and writes some simple information about the index.
bin/shadoop readfile <input>
- input - path to input file. No output file is provided as the result is directly
written to standard output.
Sample
Reads a random sample from an input file.
bin/shadoop sample <input> <output> shape:<input format>
outshape:<output format>
(count:<c> | size:<s> | ratio:<r>) seed:<sd>
- input, output - paths to input and output. If input file is indexed, the index
will be accessed to speed up the query execution. If output path is not provided, the output
will be written to standard output which is useful for debugging or when the output
is very small (a few records).
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- outshape:output format - If the output is desired to be in a different format than
input, this parameter can be provided to do a simple transformation on sampled data. The two
possible options are rectangle and point. If the output is rectangle,
the minimal bounding rectangle (MBR) of input shape is returned. If the output is
point, the center point of the MBR of each sampled record is returned.
- count:c - read a random sample that contains c records
- size:s - read a random sample with a maximum of s bytes
- ratio:r - read a random sample with with number of records of the given ratio compared to input file
- seed:sd - use the given seed for randomization. This can be useful to replicate some
experiments
Spatial Generator
The operations generate is used to generate a file with spatial data of an arbitrary
size. This can be very useful for benchmarking and stress tests.
bin/shadoop generate <output> shape:<output format> mbr:<x1,y1,x2,y2>
blocksize:<B> sindex:<grid> seed:<sd> rectsize:<rs> -overwrite
- input - path to output.
- shape:output format - shapes to generate in output file. For now, the only shapes
that are supported are point and rectangle.
- mbr:x1,y1,x2,y2 - bounding rectangle of the generated file. All generated objects
must lie within this bounding box.
blocksize:B - block size for generated file. The default is the default block size
for output path.
sindex:grid - generate the output file already indexed using a grid index. Notice
that you can only specify a grid index. R-tree indexes can only be constructed after the file
is generated because it depends on data distribution.
seed:sd - seed to use when generating data. This is useful to replicate some
experiments using the same exact data.
rectsize:rs - maximum edge size for generated rectangles .The width and height of
each generated rectangle are picked uniformly random in the range (0, rs).
-overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
Polygon Union
Computes the union of a set of polygons in an input file. This operation only works for input
files of type JTSShape. It takes one input file and computes the union of all polygons contained
in this file.
bin/shadoop union <input> <output> -overwrite
- input, output - paths to input and output. If input file is indexed, the index
will be accessed to speed up the query execution.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
Plot
Constructs an image that shows all the data in the input file. The generated file is one PNG
image that contains a plot for all the data in the input file.
bin/shadoop plot <input> <output> shape:<input format>
width:<w> height:<h> -keep-ratio color:<c> -vflip -overwrite
- input, output - paths to input and output.
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- width:w - maximum width of the generated image. Default is 1000
- height:h - maximum height of the generated image. Default is 1000
- -keep-ratio - directs SpatialHadoop to keep the aspect ratio when generating the image.
This means that the generated image will be of an aspect ratio similar to the MBR of the input file.
To avoid keeping aspect ratio, use -no-keep-ratio instead.
- color:c - the default color used to draw shapes. You can use one of the following
predefined colors: red, pink, blue, cyan, green, black, gray, and orange.
- -vflip - flips the generated image vertically.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.
Plot Pyramid
Constructs an interactive image for the input file where the user can zoom in/out and pan
around the image. You can catch as many details as you want in the generated image by generating
more zoom levels. The output consists of a set of images, each of a fixed size (Tile Width x
Tile Height) generated at different zoom levels. The output also contains an HTML file which
uses Google Maps APIs
to navigate through generated images using a Google-Maps-like interface. You might need to
be connected to the internet to use the generated files even if the generated images are stored
locally to access Google APIs.
bin/shadoop plotp <input> <output> shape:<input format>
tilewidth:<tw> tileheight:<th> numlevels:<n> -keep-ratio color:<c> -vflip -overwrite
- input, output - paths to input and output.
- shape:input format - Type of shapes stored in the input file.
Check all available built-in datatypes. You can also provide
the full name of a built-in or a custom datatype.
- tilewidth:tw - width of each generated tile in the output. Default is 256.
- tileheight:th - height of each generated tile in the output. Default is 256.
- numlevels:n - number of zoom levels in the generated image. Default is 7.
- -keep-ratio - directs SpatialHadoop to keep the aspect ratio when generating the image.
This means that the generated image will be of an aspect ratio similar to the MBR of the input file.
To avoid keeping aspect ratio, use -no-keep-ratio instead.
- color:c - the default color used to draw shapes. You can use one of the following
predefined colors: red, pink, blue, cyan, green, black, gray, and orange.
- -vflip - flips the generated image vertically.
- -overwrite - If provided, output file will be overwritten without notice. Otherwise,
the job will fail if output file already exists.