Operations

SpatialHadoop ships with many spatial operations that work efficiently on a cluster and scales up to terabytes of data. This page describes these operations with all possible parameters and customizations. You can list down all supported operations by types bin/shadoop. You can also get a quick description and arguments for a specific operation by typing bin/shadoop <operation>

index

The index command is used to construct a spatial index for a given input file. Currently, there are three spatial indexes which can be constructed in SpatialHadoop, namely, Grid, R-tree and R+-tree.

A grid index partitions the data according to a uniform grid based on the input file size. For uniform data, this index is expected to partition the input file, into equally sized partition, each of size 64MB. If an input record (e.g., rectangle) overlaps multiple partitions, it is replicated to all overlapping partitions. R-tree and R+-tree are both used for skewed input data where a uniform grid does not work properly. In R-tree, each input record is stored in only one partition. The boundaries of each partition may need to expand to enclose all contained records which means that partitions might overlap. On the other side, in R+-tree, partitions are kept disjoint while records are replicated to all overlapping partitions.

      bin/shadoop index <input> <output> shape:<input format>
         sindex:<index> blocksize:<size> -overwrite
    

Arguments

Range query

The rangequery command is used to run a spatial range query for an input file. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve matching records.

      bin/shadoop rangequery <input> <output> shape:<input format>
         rect:<rectangle> -overwrite
    

k Nearest Neighbor

The knn operation is used to run a k nearest neighbor query for an input file. Given a query point Q, this operation returns the k closest records to Q. If the input file is indexed, the index will be accessed to speed up the execution of the query. Otherwise, the whole file will be scanned to retrieve correct answer.

      bin/shadoop knn <input> <output> shape:<input format>
         point:<x,y> k:<k> -overwrite
    

Disributed Join

The dj operation performs a spatial join between two files. If both files are indexed, the distributed join algorithm joins every pair of overlapping partitions which is very scalable and efficient for large files. If at least one of the files is not indexed, the algorithm reduces to a simple block nested loop join which run in a distributed environment in SpatialHadoop.

If both files are indexed, the algorithm might automatically choose to repartition of the files to match the partitions used by the other file which, in some cases, might increase the performance of the operation by reducing total number of map tasks. This option can be overridden by command parameters.

      bin/shadoop dj <input1> <input2> <output> shape:<input format>
         repartition:<yest|no|auto> -overwrite
    

SJMR

SJMR is the MapReduce implementation of the Partition Based Spatial Merge Join by Jignesh Patel and David DeWitt. This operation is designed to perform spatial join efficiently for non-indexed files. This algorithm partitions both files according to a uniform grid and joins every pair of matching grid cells. The dimensions of the partitioning grid are automatically determined based on input file sizes such that average partition size is equal to HDFS block size (e.g., 64MB). If one or both of the input files are skewed, the performance of this algorithm may deteriorate.

      bin/shadoop sjmr <input1> <input2> <output> shape:<input format>
         -overwrite
    

Minimal Bounding Rectangle

The mbr operation determines the minimal bounding rectangle (MBR) of an input file. If the file is indexed, the MBR is retrieved directly from the index. Otherwise, the whole file is scanned and the MBR is calculated as a simple aggregate function.

      bin/shadoop mbr <input> shape:<input format>
    

Read file

Reads an indexed file and writes some simple information about the index.

      bin/shadoop readfile <input>
    

Sample

Reads a random sample from an input file.

      bin/shadoop sample <input> <output>  shape:<input format>
        outshape:<output format>
        (count:<c> | size:<s> | ratio:<r>) seed:<sd>        
    

Spatial Generator

The operations generate is used to generate a file with spatial data of an arbitrary size. This can be very useful for benchmarking and stress tests.

      bin/shadoop generate <output> shape:<output format> mbr:<x1,y1,x2,y2>
        blocksize:<B> sindex:<grid> seed:<sd> rectsize:<rs> -overwrite
    
  • blocksize:B - block size for generated file. The default is the default block size for output path.
  • sindex:grid - generate the output file already indexed using a grid index. Notice that you can only specify a grid index. R-tree indexes can only be constructed after the file is generated because it depends on data distribution.
  • seed:sd - seed to use when generating data. This is useful to replicate some experiments using the same exact data.
  • rectsize:rs - maximum edge size for generated rectangles .The width and height of each generated rectangle are picked uniformly random in the range (0, rs).
  • -overwrite - If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
  • Polygon Union

    Computes the union of a set of polygons in an input file. This operation only works for input files of type JTSShape. It takes one input file and computes the union of all polygons contained in this file.

          bin/shadoop union <input> <output> -overwrite
        

    Plot

    Constructs an image that shows all the data in the input file. The generated file is one PNG image that contains a plot for all the data in the input file.

          bin/shadoop plot <input> <output> shape:<input format>
          width:<w> height:<h> -keep-ratio color:<c> -vflip -overwrite
        

    Plot Pyramid

    Constructs an interactive image for the input file where the user can zoom in/out and pan around the image. You can catch as many details as you want in the generated image by generating more zoom levels. The output consists of a set of images, each of a fixed size (Tile Width x Tile Height) generated at different zoom levels. The output also contains an HTML file which uses Google Maps APIs to navigate through generated images using a Google-Maps-like interface. You might need to be connected to the internet to use the generated files even if the generated images are stored locally to access Google APIs.

          bin/shadoop plotp <input> <output> shape:<input format>
          tilewidth:<tw> tileheight:<th> numlevels:<n> -keep-ratio color:<c> -vflip -overwrite