Spatial index

This tutorial describes how to access the spatial indexes built in SpatialHadoop. We assume that there is an index already constructed using the index command and stored on disk. In this tutorial we show how to retrieve basic information about the index and how to perform simple queries on it. Notice that this tutorial does not directly show how to use it in MapReduce programs. Rather, it shows low-level interaction with the index which can be indirectly used in MapReduce programs. For users interested in writing spatial MapReduce operations, please check the operations tutorial

Index Layout

The first step to interact with the index is to understand how it is organized on disk. Let's say we issue the index command to create an R-tree index in the path 'parks.rtree'. The target will be stored as a folder that looks like the one shown below.

The index is stored as a set of data files where each data file contains the records in one partition. The index also contains one _master.xxx file which contains meta data about the global index (i.e., file partitions). Simply, it contains one line per partition which contains the boundaries of the partition and the partition file name. The extension of the master file indicates the type of index constructed. The supported indexes are grid, rtree and r+tree. Below is an example of a small master file with three partitions.

-179.3248215,-54.934357,6.9290401,71.2885321,part-00000_data_00001
-171.7735299,-54.8114255,6.9261512,65.1485099,part-00000_data_00001_1
6.9225032,-46.44586,179.3801209,78.0657531,part-00000_data_00002_2
    

All other files are data files which contain data records. For a grid index, each partition file is a simple text file which contains one record per line. For R-tree and R+-tree, partition files are a little bit more complex. Records in each partition are organized in an R-tree index. Each file contains two sections. The first section stores the R-tree structure in a binary format as a list of nodes stored in level-order traversal. The second section stores data records in a text format as one record per line. The header of the R-tree contains information about the size of the tree which allows SpatialHadoop to skip over the R-tree structure and reads the records directly if the R-tree is not useful for processing.

Access Global Index

All the information about the global index is stored in the master file. However, you don't need to parse the file yourself. You can make an API call which retrieves the global index and returns it as one object. The method SpatialSite#getGlobalIndex(FileSystem, Path) takes a file system and a path to a directory in that file system and returns the associated master file. If the path indicates a non-indexed data, null is returned. The returned value is of type GlobalIndex<Partition>. You can iterate over all partitions using the iterator method. You can also retrieve specific partitions using the rangeQuery and knn methods.

Access Records in a Partition

Once you retrieve a partition or a set of partitions, the next step is to read the records stored in that partition. The format of partition files is different depending on the index type. In grid index, partitions are stored as text files, while in R-tree and R+-tree, partitions organize the data in an R-tree. To simplify the parsing of the partition, the API contains an abstract class SpatialRecordReader which contains all the logic needed for parsing a data file. This class automatically detects the format of the partition and parses it accordingly. In addition to that abstract class, SpatialHadoop also contains a set of concrete classes that retrieves records from the file in a specific format. All of them are instances of RecordReader which allows them to be used in MapReduce programs. To adhere with the MapReduce programming interface, each record has to be represented as a key-value pair. SpatialHadoop always uses the partition boundaries as the key represented as an object of type Rectangle. The value differs according to the specific reader instantiated/used. The following table summarizes all these classes and describes the format of both the key and value returned by them.

ClassValueDescription
ShapeLineRecordReaderText Returns each record in its raw text format without parsing it. Useful to speedup the processing of high level aggregation functions such as count. If the partition is locally indexed using and R-tree, the tree structure is skipped and the lines are read directly
ShapeRecordReaderShape Similar to the ShapeLineRecordReader but adds an extra step of actually parsing the text line into an object of type Shape.
ShapeArrayRecordReaderArrayWritable Instead of returning shapes one by one, this class reads a bunch of records and returns all of them as an object of type ArrayWritable. The maximum number of objects to be returned in one function call can be set in the job configuration spatialHadoop.mapred.MaxShapesPerRead. The default is 1,000,000 objects. Since some records are really big and 1,000,000 records can be too large in this case, there is another configuration (spatialHadoop.mapred.MaxBytesPerRead) that limits the number of bytes read in one function call. The default of the later configuration is 32MB. Setting any of these two parameters to a non-positive number, makes SpatialHadoop skips the corresponding check and increases the threshold to infinity. For example, setting spatialHadoop.mapred.MaxShapesPerRead to -1 while keeping spatialHadoop.mapred.MaxBytesPerRead at 32MB, results in reading each 32MB in one read regardless of number of records returned.
RTreeRecordReaderRTree<Shape> For R-tree indexed partitions, this class returns each partition as one object of type RTree which maps to the R-tree stored in that partition. Note that only the structure of the R-tree structure is read during the next operation for efficiency. Records are retrieved from disk only when requested by user. If this class is used for a grid file, an runtime exception is thrown.