Operations

SpatialHadoop ships with many spatial operations that work efficiently on a cluster and scales up to terabytes of data. This page describes these operations with all possible parameters and customizations. You can list down all supported operations by types bin/shadoop. You can also get a quick description and arguments for a specific operation by typing bin/shadoop <operation>

index

The index command is used to construct a spatial index for a given input file. Currently, there are three spatial indexes which can be constructed in SpatialHadoop, namely, Grid, R-tree and R+-tree.

A grid index partitions the data according to a uniform grid based on the input file size. For uniform data, this index is expected to partition the input file, into equally sized partition, each of size 64MB. If an input record (e.g., rectangle) overlaps multiple partitions, it is replicated to all overlapping partitions. R-tree and R+-tree are both used for skewed input data where a uniform grid does not work properly. In R-tree, each input record is stored in only one partition. The boundaries of each partition may need to expand to enclose all contained records which means that partitions might overlap. On the other side, in R+-tree, partitions are kept disjoint while records are replicated to all overlapping partitions.

      bin/shadoop index <input> <output> shape:<input format>
         sindex:<index> blocksize:<size> -overwrite

Arguments

input, output - Path to input and output files
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
sindex:index - Type of index to construct, either grid, rtree or r+tree.
blocksize:size - Block size for generated file. By default, it uses the default block size in the output directory.
-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Range query

The rangequery command is used to run a spatial range query for an input file. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve matching records.

      bin/shadoop rangequery <input> <output> shape:<input format>
         rect:<rectangle> -overwrite

input, output - paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
rect:rectangle
-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

k Nearest Neighbor

The knn operation is used to run a k nearest neighbor query for an input file. Given a query point Q, this operation returns the k closest records to Q. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve correct answer.

      bin/shadoop knn <input> <output> shape:<input format>
         point:<x,y> k:<k> -overwrite

input, output - paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
point:x,y
k:k
-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Disributed Join

The dj operation performs a spatial join between two files. If both files are indexed, the distributed join algorithm joins every pair of overlapping partitions which is very scalable and efficient for large files. If at least one of the files is not indexed, the algorithm reduces to a simple block nested loop join which run in a distributed environment in SpatialHadoop.

If both files are indexed, the algorithm might automatically choose to repartition of the files to match the partitions used by the other file which, in some cases, might increase the performance of the operation by reducing total number of map tasks. This option can be overridden by command parameters.

      bin/shadoop dj <input1> <input2> <output> shape:<input format>
         repartition:<yest|no|auto> -overwrite

input, output - paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
repartition:yes|no|auto

yes

no

auto

-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

SJMR

SJMR is the MapReduce implementation of the Partition Based Spatial Merge Join by Jignesh Patel and David DeWitt. This operation is designed to perform spatial join efficiently for non-indexed files. This algorithm partitions both files according to a uniform grid and joins every pair of matching grid cells. The dimensions of the partitioning grid are automatically determined based on input file sizes such that average partition size is equal to HDFS block size (e.g., 64MB). If one or both of the input files are skewed, the performance of this algorithm may deteriorate.

      bin/shadoop sjmr <input1> <input2> <output> shape:<input format>
         -overwrite

input, output - paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be discarded which is useful for benchmarking if the actual output is not important.
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Minimal Bounding Rectangle

The mbr operation determines the minimal bounding rectangle (MBR) of an input file. If the file is indexed, the MBR is retrieved directly from the index. Otherwise, the whole file is scanned and the MBR is calculated as a simple aggregate function.

      bin/shadoop mbr <input> shape:<input format>

input - path to input file. No output file is provided as the result is directly written to standard output. If the input is a directory, the output is cached by writing a master file in the same directory.
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.

Read file

Reads an indexed file and writes some simple information about the index.

      bin/shadoop readfile <input>

input - path to input file. No output file is provided as the result is directly written to standard output.

Sample

Reads a random sample from an input file.

      bin/shadoop sample <input> <output>  shape:<input format>
        outshape:<output format>
        (count:<c> | size:<s> | ratio:<r>) seed:<sd>

input, output - paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution. If output path is not provided, the output will be written to standard output which is useful for debugging or when the output is very small (a few records).
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
outshape:output format - If the output is desired to be in a different format than input, this parameter can be provided to do a simple transformation on sampled data. The two possible options are rectangle and point. If the output is rectangle, the minimal bounding rectangle (MBR) of input shape is returned. If the output is point, the center point of the MBR of each sampled record is returned.
count:c - read a random sample that contains c records
size:s - read a random sample with a maximum of s bytes
ratio:r - read a random sample with with number of records of the given ratio compared to input file
seed:sd - use the given seed for randomization. This can be useful to replicate some experiments

Spatial Generator

The operations generate is used to generate a file with spatial data of an arbitrary size. This can be very useful for benchmarking and stress tests.

      bin/shadoop generate <output> shape:<output format> mbr:<x1,y1,x2,y2>
        blocksize:<B> sindex:<grid> seed:<sd> rectsize:<rs> -overwrite

input - path to output.
shape:output format - shapes to generate in output file. For now, the only shapes that are supported are point and rectangle.
mbr:x1,y1,x2,y2 - bounding rectangle of the generated file. All generated objects must lie within this bounding box.

blocksize:B - block size for generated file. The default is the default block size for output path.

sindex:grid - generate the output file already indexed using a grid index. Notice that you can only specify a grid index. R-tree indexes can only be constructed after the file is generated because it depends on data distribution.

seed:sd - seed to use when generating data. This is useful to replicate some experiments using the same exact data.

rectsize:rs - maximum edge size for generated rectangles .The width and height of each generated rectangle are picked uniformly random in the range (0, rs).

-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Polygon Union

Computes the union of a set of polygons in an input file. This operation only works for input files of type JTSShape. It takes one input file and computes the union of all polygons contained in this file.

      bin/shadoop union <input> <output> -overwrite

input, output - paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.
-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Plot

Constructs an image that shows all the data in the input file. The generated file is one PNG image that contains a plot for all the data in the input file.

      bin/shadoop plot <input> <output> shape:<input format>
      width:<w> height:<h> -keep-ratio color:<c> -vflip -overwrite

input, output - paths to input and output.
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
width:w - maximum width of the generated image. Default is 1000
height:h - maximum height of the generated image. Default is 1000
-keep-ratio - directs SpatialHadoop to keep the aspect ratio when generating the image. This means that the generated image will be of an aspect ratio similar to the MBR of the input file. To avoid keeping aspect ratio, use -no-keep-ratio instead.
color:c - the default color used to draw shapes. You can use one of the following predefined colors: red, pink, blue, cyan, green, black, gray, and orange.
-vflip - flips the generated image vertically.
-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Plot Pyramid

Constructs an interactive image for the input file where the user can zoom in/out and pan around the image. You can catch as many details as you want in the generated image by generating more zoom levels. The output consists of a set of images, each of a fixed size (Tile Width x Tile Height) generated at different zoom levels. The output also contains an HTML file which uses Google Maps APIs to navigate through generated images using a Google-Maps-like interface. You might need to be connected to the internet to use the generated files even if the generated images are stored locally to access Google APIs.

      bin/shadoop plotp <input> <output> shape:<input format>
      tilewidth:<tw> tileheight:<th> numlevels:<n> -keep-ratio color:<c> -vflip -overwrite

input, output - paths to input and output.
shape:input format - Type of shapes stored in the input file. Check all available built-in datatypes. You can also provide the full name of a built-in or a custom datatype.
tilewidth:tw - width of each generated tile in the output. Default is 256.
tileheight:th - height of each generated tile in the output. Default is 256.
numlevels:n - number of zoom levels in the generated image. Default is 7.
-keep-ratio - directs SpatialHadoop to keep the aspect ratio when generating the image. This means that the generated image will be of an aspect ratio similar to the MBR of the input file. To avoid keeping aspect ratio, use -no-keep-ratio instead.
color:c - the default color used to draw shapes. You can use one of the following predefined colors: red, pink, blue, cyan, green, black, gray, and orange.
-vflip - flips the generated image vertically.
-overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.