A MapReduce framework for spatial data

A recent addition to SpatialHadoop is CG_Hadoop, a set of computational geometry operations written in MapReduce. This is useful to scale up computational geometry operations for datasets of tera bytes and billions of records. We started with the polygon union operation which computes the union of a set of input polygons. We then made other fundamental computation geometry operations including skyline, convex hull, farthest pair and closest pair. Each operation runs on both indexed and non-indexed files. The version that runs on non-indexed (heap) files is much slower as it needs to scan the whole file. The indexed version is much more efficient as it makes good use of the index to speed up the processing and early prune file partitions that do not need to be processed.

The technical details behind these algorithms are described in our paper in ACM SIGSPATIAL paper titled "CG_Hadoop: Computational Geometry in MapReduce". All the source code is available in our github repository in the operations package.

To use the computational geometry operations in SpatialHadoop, you need to write the correct command for each one. Below, we describe how to use each operation.

Computes the union of a set of polygons in an input file. This operation only works for input files of type JTSShape. It takes one input file and computes the union of all polygons contained in this file.

bin/shadoop union <input> <output> -overwrite

**input, output**- paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.**-overwrite**- If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Computes the skyline of a set of points. The skyline is the set of *non-dominated* points in the input dataset.
A point *p* is said to dominate another point *q* if *p* is larger than or equal to *q* in both *x* and *y* dimensions.

bin/shadoop skyline <input> <output> dir:<direction> -overwrite

**input, output**- paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.**dir:rection**- the direction to compute the skyline. Valid directions are max-max, max-min, min-max, and min-min. For example, max-min means that a point*p*dominates another point*q*if it is larger in*x*and smaller in*y*.**-overwrite**- If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Computes the convex hull of a set of points. The convex hull is the minimal convex polygon that contains all input points. This operations returns only the subset of points that form the convex hull, i.e., the corner points of the convex hull.

bin/shadoop convexhull <input> <output> -overwrite

**input, output**- paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.**-overwrite**- If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

As its name states, this method compute the two points with the largest Euclidean distance. Since the two points forming the farthest pair are always two corners on the convex hull, this method assumes that the input points already form a convex hull. If the input is just a set of scattered points, you should call the convex hull operation first. The method does not explicitly call the convex hull operation to avoid unnecessary processing in case the input is already on a convex hull.

bin/shadoop farthestpair <input> <output> -overwrite

**input, output**- paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.**-overwrite**- If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

This method is the direct opposite of farthest pair. It finds the
pair of points that has the *minimum* Euclidean distance between them.

bin/shadoop closestpair <input> <output> -overwrite

**input, output**- paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.**-overwrite**- If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.