SpatialHadoop

SpatialHadoop Overview

SpatialHadoop is a MapReduce framework designed specially to work with spatial data. Use it to analyze your huge spatial datasets on a cluster of machines.

Installing and configuring SpatialHadoop

SpatialHadoop is built on Hadoop 1.0.3 and can be installed just the same way you do with the traditional Hadoop. This tutorial will take you through the process of setting up SpatialHadoop on a single machine and running some examples. It can be easily expanded to a cluser of machines by following this tutorial.

Prerequisites

SpatialHadoop is supported mainly under Linux. Windows is partially supported for development but not for production.
Java^TM 1.6.x, preferably from Sun, must be installed. We recommend using Java 1.6.0_24. ?

To check the version of Java installed type
$ java -version

To install java type
$ sudo apt-get install openjdk-6-jre-headless
SSH client and SSH server. ?

To check if both SSH client and server are installed correctly type
$ ssh localhost echo success
You should see the message 'success' on screen

To install both SSH client and server type
$ sudo apt-get install openssh-client openssh-server

To configure SSH to login to your local host with no password type
$ mkdir ~/.ssh && touch ~/.ssh/authorized_key
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

To allow SSH to login to your localhost without prompt type
$ ssh localhost echo success
If prompted, accept the fingerprint by typing "yes" and hiting "ENTER".
For Windows: Cygwin is Required for shell support in addition to the required software above.

Download

Download the latest version of SpatialHadoop here.

Configure on one machine

Extract the downloaded compressed file into a local directory
Edit conf/hadoop-env.sh in the extracted folder and set JAVA_HOME to your current Java installation. ?

The default location in Ubuntu installations is:
/usr/lib/jvm/java-6-openjdk/jre -- For 32-bit version
/usr/lib/jvm/java-6-openjdk-amd64/jre -- For 64-bit version
You can lookup your java home directory by typing
$ which java

To set JAVA_HOME, find a commented line that looks like
# export JAVA_HOME=<something>
Uncomment this line, and replace <something> with your Java home path.

Configure SpatialHadoop by using the following configuration files
conf/core-site.xml:

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

conf/hdfs-site.xml:

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>

conf/mapred-site.xml:

<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>

Format Hadoop Distributed File System (HDFS) by typing
$ bin/hadoop namenode -format
Start a local SpatialHadoop cluster by typing
$ bin/start-all.sh

Usage examples

Once you have SpatialHadoop configured correctly, you are ready to run some sample programs. The following steps will generate a random file, index it using a Grid index, and run some spatial queries on the indexed file. The classes needed for this example are all contained in the hadoop-operations-*.jar shipped with the binary release. You can type 'bin/hadoop jar hadoop-operations-*.jar' to get the usage syntax for the available operations.

To generate a random file containing random rectangles, enter the following command

$ bin/hadoop jar hadoop-operations-*.jar generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect

This generates a 1GB file named 'test', where all rectangles in the file are contained in the rectangle with corner at (0,0) and dimensions 1Mx1M units.

If you have your own file that needs to be processed, you can upload it the same way you do with traditional Hadoop by typing

$ bin/hadoop fs -copyFromLocal <local file path> <HDFS file path>

Then you can index this file using the following command

To index this file using a global-only Grid index

$ bin/hadoop jar hadoop-operations-*.jar index test test.grid mbr:0,0,1000000,1000000 global:grid

To see how the grid index partitions this file, type:

$ bin/hadoop jar hadoop-operations-*.jar readfile test.grid

This shows the list of partitions in file, each defined by boundaries, along with the number of blocks in each partition.

To run a range query operation on this file

$ bin/hadoop jar hadoop-operations-*.jar rangequery test.grid rq_results rect:500,500,1000,1000

This runs a range query over this file with the query range set to the rectangle at (500,500) with dimensions 1000x1000. The results will be stored in an HDFS file named 'rq_result'

To run a knn query operation on this file

$ bin/hadoop jar hadoop-operations-*.jar knn test.grid knn_results point:1000,1000 k:1000

This runs a knn query where the query point is at (1000,1000) and k=1000. The results are stored in HDFS file 'knn_results'

To run a spatial join operation

First, generate another file and have it globally indexed on the fly using the command

$ bin/hadoop jar hadoop-operations-*.jar generate test2.grid mbr:0,0,1000000,1000000 size:100.mb global:grid

Now, join the two files via the Distributed Join algorithm using the command

$ bin/hadoop jar hadoop-operations-*.jar dj test.grid test2.grid sj_results

SpatialHadoop virtual machine

For your convenince, SpatialHadoop is also available as a virtual machine image. This virtual machine has the latest version of SpatialHadoop already installed and configured. Just run it and you are ready to go. This machine was created and tested on VirtualBox version-4.2.6 available for free download.

Download the latest version of VirtualBox here
Download the latest virtual machine image here and extract it
Start VirtualBox, choose from the drop down menus 'Machine'->'Add Machine', and browse to the extracted machine image.
Start the virtual machine by clicking the 'Start' button. The username/password are 'shadoop/shadoop'.
SpatialHadoop is located in '~/hadoop-*' and configured to run in a single-machine setup.

All downloads

Source code
Binary distribution
Amazon EC2 Image (Coming soon)

SpatialHadoop

Analyze your spatial data efficiently

Spatial language

Spatial data types

Spatial indexes

Spatial operations