Extensible data types

SpatialHadoop ships with several data types including (Point, Rectangle and Polygon). There are different cases where you'll need to extend these data types or implement new spatial data types.

This tutorial describes in details how current data types are implemented and how to extend it with more spatial data types.

Built-in data types

SpatialHadoop contains three main spatial data types, namely, Point, Rectangle and Polygon. Each data types stores just the spatial information about the shape without any extra information. All shapes are two-dimensional in the Euclidean space. A point is represented by its two dimensions (X and Y). A rectangle is represented by a corner point (X, Y) and the dimensions (Width x Height). A polygon is represented as a list of two-dimensional points.

The main storage format for spatial data types in SpatialHadoop is the text format. This makes it easier to import/export legacy formats in other applications. The standard format is a CSV format where each record is stored in one line. This format can be changed for custom data types provided that each record is stored in exactly one line. For point, a line contains two fields (X,Y) separated by a comma. For a rectangle, the tuple (X, Y, Width, Height) is stored with a comma as a separator. For polygon, each line contains number of points followed by the coordinates of each point. For example, a triangle with the corner points (0,0), (1,1) and (1,0) can be represented as "3,0,0,1,1,1,0". All coordinates used in the standard data types are long integers (64-bit).

User-defined data types

To define your own data type, you need to define it as a new class that implements the Shape interface. For convenience, you could choose to extend one of the standard data types and built on top of it instead of building a class from scratch. For example, let's say that your files contain records represented as rectangles. Unlike the standard rectangles, each rectangle has an additional ID that precedes the coordinates of the rectangle. You can extend the rectangle class and add an additional ID field.

RectangleID.java:

      
public class RectangleID extends Rectangle {
  private int id;
      

The new field must be also written when an object of this class is serialized over network. This is required by SpatialHadoop (and Hadoop) when objects are transferred from mappers to reducers. This can be done as follows.

public void write(DataOutput out) throws IOException {
	out.writeInt(id);
	super.write(out);
}

public void readFields(DataInput in) throws IOException {
	id = in.readInt();
	super.readFields(in);
}
      

You need also to specify the format of the input file that contains objects of this type. This done by implementing two methods fromText and toText. The first method takes as input a text that represents a line read from the input file, and parses it to fill the target object. The second method does the exact opposite of this. It takes a Text object and serializes the information stored in this object to this text. It should not add a terminating new line as this is added by the framework itself. The implementation of these two method will look like this.

public void fromText(Text text) {
	id = TextSerializerHelper.consumeInt(text, ',');
	super.fromText(text);
}

public Text toText(Text text) {
	TextSerializerHelper.serializeInt(id, text, ',');
	return super.toText(text);
}
      

Finally, you need to override the clone method in the new data type.

public RectangleID clone() {
  RectangleID c = new RectangleID();
  c.id = this.id; // Set the id field
  c.set(this); // Set rectangle boundaries
  return c;
}
      

Once you're done with this class, you can use it with the existing operations (range query, kNN and spatial join). You need to package your class in a JAR file so that SpatialHadoop can load it. Let us assume the JAR file is called 'recid.jar', the command line for running a range query could be as follows.
$ bin/shadoop rangequery -libjars recid.jar input-filename rect:500,500,1000,1000 shape:RectangleID
Note that you should use the fully qualified name of the class (i.e., package + class name).