A MapReduce framework for spatial data
A recent addition to SpatialHadoop is CG_Hadoop, a set of computational geometry operations written in MapReduce. This is useful to scale up computational geometry operations for datasets of tera bytes and billions of records. We started with the polygon union operation which computes the union of a set of input polygons. We then made other fundamental computation geometry operations including skyline, convex hull, farthest pair and closest pair. Each operation runs on both indexed and non-indexed files. The version that runs on non-indexed (heap) files is much slower as it needs to scan the whole file. The indexed version is much more efficient as it makes good use of the index to speed up the processing and early prune file partitions that do not need to be processed.
The technical details behind these algorithms are described in our paper in ACM SIGSPATIAL paper titled "CG_Hadoop: Computational Geometry in MapReduce". All the source code is available in our github repository in the operations package.
To use the computational geometry operations in SpatialHadoop, you need to write the correct command for each one. Below, we describe how to use each operation.
Computes the union of a set of polygons in an input file. This operation only works for input files of type JTSShape. It takes one input file and computes the union of all polygons contained in this file.
bin/shadoop union <input> <output> -overwrite
Computes the skyline of a set of points. The skyline is the set of non-dominated points in the input dataset. A point p is said to dominate another point q if p is larger than or equal to q in both x and y dimensions.
bin/shadoop skyline <input> <output> dir:<direction> -overwrite
Computes the convex hull of a set of points. The convex hull is the minimal convex polygon that contains all input points. This operations returns only the subset of points that form the convex hull, i.e., the corner points of the convex hull.
bin/shadoop convexhull <input> <output> -overwrite
As its name states, this method compute the two points with the largest Euclidean distance. Since the two points forming the farthest pair are always two corners on the convex hull, this method assumes that the input points already form a convex hull. If the input is just a set of scattered points, you should call the convex hull operation first. The method does not explicitly call the convex hull operation to avoid unnecessary processing in case the input is already on a convex hull.
bin/shadoop farthestpair <input> <output> -overwrite
This method is the direct opposite of farthest pair. It finds the pair of points that has the minimum Euclidean distance between them.
bin/shadoop closestpair <input> <output> -overwrite