SpatialHadoop is a MapReduce framework designed specially to work with spatial data. Use it to analyze your huge spatial datasets on a cluster of machines.
Installing and configuring SpatialHadoop
SpatialHadoop is built on Hadoop 1.0.3 and can be installed just the same way you do with the traditional Hadoop. This tutorial will take you through the process of setting up SpatialHadoop on a single machine and running some examples. It can be easily expanded to a cluser of machines by following this tutorial.
- SpatialHadoop is supported mainly under Linux. Windows is partially supported for development but not for production.
JavaTM 1.6.x, preferably from Sun, must be installed. We recommend using Java 1.6.0_24.
To check the version of Java installed type
$ java -version
To install java type
$ sudo apt-get install openjdk-6-jre-headless
SSH client and SSH server.
To check if both SSH client and server are installed correctly type
$ ssh localhost echo success
You should see the message 'success' on screen
To install both SSH client and server type
$ sudo apt-get install openssh-client openssh-server
To configure SSH to login to your local host with no password type
$ mkdir ~/.ssh && touch ~/.ssh/authorized_key
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
To allow SSH to login to your localhost without prompt type
$ ssh localhost echo success
If prompted, accept the fingerprint by typing "yes" and hiting "ENTER".
- For Windows: Cygwin is Required for shell support in addition to the required software above.
Download the latest version of SpatialHadoop here.
Configure on one machine
- Extract the downloaded compressed file into a local directory
Edit conf/hadoop-env.sh in the extracted folder and set JAVA_HOME to your current Java installation.
The default location in Ubuntu installations is:
/usr/lib/jvm/java-6-openjdk/jre -- For 32-bit version
/usr/lib/jvm/java-6-openjdk-amd64/jre -- For 64-bit version
You can lookup your java home directory by typing
$ which java
To set JAVA_HOME, find a commented line that looks like
# export JAVA_HOME=<something>
Uncomment this line, and replace <something> with your Java home path.
Configure SpatialHadoop by using the following configuration files
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>conf/hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>conf/mapred-site.xml:
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
Format Hadoop Distributed File System (HDFS) by typing
$ bin/hadoop namenode -format
Start a local SpatialHadoop cluster by typing
Once you have SpatialHadoop configured correctly, you are ready to run some sample programs. The following steps will generate a random file, index it using a Grid index, and run some spatial queries on the indexed file. The classes needed for this example are all contained in the hadoop-operations-*.jar shipped with the binary release. You can type 'bin/hadoop jar hadoop-operations-*.jar' to get the usage syntax for the available operations.
To generate a random file containing random rectangles, enter the following command$ bin/hadoop jar hadoop-operations-*.jar generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect
This generates a 1GB file named 'test', where all rectangles in the file are contained in the rectangle with corner at (0,0) and dimensions 1Mx1M units.
If you have your own file that needs to be processed, you can upload it the same way you do with traditional Hadoop by typing$ bin/hadoop fs -copyFromLocal <local file path> <HDFS file path>
Then you can index this file using the following command
To index this file using a global-only Grid index$ bin/hadoop jar hadoop-operations-*.jar index test test.grid mbr:0,0,1000000,1000000 global:grid
To see how the grid index partitions this file, type:$ bin/hadoop jar hadoop-operations-*.jar readfile test.grid
This shows the list of partitions in file, each defined by boundaries, along with the number of blocks in each partition.
To run a range query operation on this file$ bin/hadoop jar hadoop-operations-*.jar rangequery test.grid rq_results rect:500,500,1000,1000
This runs a range query over this file with the query range set to the rectangle at (500,500) with dimensions 1000x1000. The results will be stored in an HDFS file named 'rq_result'
To run a knn query operation on this file$ bin/hadoop jar hadoop-operations-*.jar knn test.grid knn_results point:1000,1000 k:1000
This runs a knn query where the query point is at (1000,1000) and k=1000. The results are stored in HDFS file 'knn_results'
To run a spatial join operation
First, generate another file and have it globally indexed on the fly using the command$ bin/hadoop jar hadoop-operations-*.jar generate test2.grid mbr:0,0,1000000,1000000 size:100.mb global:grid
Now, join the two files via the Distributed Join algorithm using the command$ bin/hadoop jar hadoop-operations-*.jar dj test.grid test2.grid sj_results
SpatialHadoop virtual machine
For your convenince, SpatialHadoop is also available as a virtual machine image. This virtual machine has the latest version of SpatialHadoop already installed and configured. Just run it and you are ready to go. This machine was created and tested on VirtualBox version-4.2.6 available for free download.
- Download the latest version of VirtualBox here
- Download the latest virtual machine image here and extract it
- Start VirtualBox, choose from the drop down menus 'Machine'->'Add Machine', and browse to the extracted machine image.
- Start the virtual machine by clicking the 'Start' button. The username/password are 'shadoop/shadoop'.
- SpatialHadoop is located in '~/hadoop-*' and configured to run in a single-machine setup.