SpatialHadoop Overview
SpatialHadoop is a MapReduce framework designed specially to work with spatial data. Use it to analyze your huge spatial datasets on a cluster of machines.
Installing and configuring SpatialHadoop
SpatialHadoop is built on Hadoop 1.0.3 and can be installed just the same way you do with the traditional Hadoop. This tutorial will take you through the process of setting up SpatialHadoop on a single machine and running some examples. It can be easily expanded to a cluser of machines by following this tutorial.
Prerequisites
- SpatialHadoop is supported mainly under Linux. Windows is partially supported for development but not for production.
- JavaTM 1.6.x, preferably from Sun, must be installed. We recommend using Java 1.6.0_24. ?
- SSH client and SSH server. ?
- For Windows: Cygwin is Required for shell support in addition to the required software above.
Download
Download the latest version of SpatialHadoop here.
Configure on one machine
- Extract the downloaded compressed file into a local directory
- Edit conf/hadoop-env.sh in the extracted folder and set JAVA_HOME to your current Java installation. ?
-
Configure SpatialHadoop by using the following configuration files
conf/core-site.xml:<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
conf/hdfs-site.xml:<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
conf/mapred-site.xml:<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
-
Format Hadoop Distributed File System (HDFS) by typing
$ bin/hadoop namenode -format -
Start a local SpatialHadoop cluster by typing
$ bin/start-all.sh
Usage examples
Once you have SpatialHadoop configured correctly, you are ready to run some sample programs. The following steps will generate a random file, index it using a Grid index, and run some spatial queries on the indexed file. The classes needed for this example are all contained in the hadoop-operations-*.jar shipped with the binary release. You can type 'bin/hadoop jar hadoop-operations-*.jar' to get the usage syntax for the available operations.
To generate a random file containing random rectangles, enter the following command
$ bin/hadoop jar hadoop-operations-*.jar generate test mbr:0,0,1000000,1000000 size:1.gb shape:rectThis generates a 1GB file named 'test', where all rectangles in the file are contained in the rectangle with corner at (0,0) and dimensions 1Mx1M units.
If you have your own file that needs to be processed, you can upload it the same way you do with traditional Hadoop by typing
$ bin/hadoop fs -copyFromLocal <local file path> <HDFS file path>Then you can index this file using the following command
To index this file using a global-only Grid index
$ bin/hadoop jar hadoop-operations-*.jar index test test.grid mbr:0,0,1000000,1000000 global:gridTo see how the grid index partitions this file, type:
$ bin/hadoop jar hadoop-operations-*.jar readfile test.gridThis shows the list of partitions in file, each defined by boundaries, along with the number of blocks in each partition.
To run a range query operation on this file
$ bin/hadoop jar hadoop-operations-*.jar rangequery test.grid rq_results rect:500,500,1000,1000This runs a range query over this file with the query range set to the rectangle at (500,500) with dimensions 1000x1000. The results will be stored in an HDFS file named 'rq_result'
To run a knn query operation on this file
$ bin/hadoop jar hadoop-operations-*.jar knn test.grid knn_results point:1000,1000 k:1000This runs a knn query where the query point is at (1000,1000) and k=1000. The results are stored in HDFS file 'knn_results'
To run a spatial join operation
First, generate another file and have it globally indexed on the fly using the command
$ bin/hadoop jar hadoop-operations-*.jar generate test2.grid mbr:0,0,1000000,1000000 size:100.mb global:gridNow, join the two files via the Distributed Join algorithm using the command
$ bin/hadoop jar hadoop-operations-*.jar dj test.grid test2.grid sj_resultsSpatialHadoop virtual machine
For your convenince, SpatialHadoop is also available as a virtual machine image. This virtual machine has the latest version of SpatialHadoop already installed and configured. Just run it and you are ready to go. This machine was created and tested on VirtualBox version-4.2.6 available for free download.
- Download the latest version of VirtualBox here
- Download the latest virtual machine image here and extract it
- Start VirtualBox, choose from the drop down menus 'Machine'->'Add Machine', and browse to the extracted machine image.
- Start the virtual machine by clicking the 'Start' button. The username/password are 'shadoop/shadoop'.
- SpatialHadoop is located in '~/hadoop-*' and configured to run in a single-machine setup.
All downloads
- Source code
- Binary distribution a>
- Amazon EC2 Image (Coming soon)