SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. Use it to analyze your huge spatial datasets on a cluster of machines.
What is new
- Preliminary support for HDF files in MODIS datasets brought by NASA.
- SpatialHadoop operations are now easier to access through a new bin/shadoop command
- Several bug fixes in visualization operations
Version 2 beta 2
- An all new PlotPyramid operation that plots a set of tiles ready to be visualized using an engine similar to Google Maps. Users can control the size of each tile as well as the number of zoom levels generated.
Version 2 beta 0
- Version 2 of SpatialHadoop is a remodel of the system that is much more flexible and powerful. Most importantly, this version ships mainly as an extension which can be plugged into an existing Hadoop cluster. This means you can keep using your own Hadoop cluster and just extend it with spatial capabilities by installing SpatialHadoop on it. You still get the orders of magnitude performance improvement because SpatialHadoop classes work with the core Hadoop allowing it to efficiently process spatial data.
SpatialHadoop v2 also ships with many new features and improvements.
- Basic data types now support double precision numbers for coordinates
- Added support for OGC standard data types and operations such as Linestring and MultiPolygon
- SpatialHadoop can now run in standalone mode making it easier than before to debug
- New operations added including convex hull, farthest/closest pair and skyline
- Several performance improvements and bug fixes for existing operations
Installing and configuring SpatialHadoop
SpatialHadoop is designed in a generic way which allows it to run on any configured Hadoop cluster. SpatialHadoop was tested on Apache Hadoop 1.2.1 but it should run seamlessly on other distributions of Hadoop as well. First, you need to setup traditional Hadoop. Once Hadoop is configured, you can install SpatialHadoop on that distribution which adds the new classes and configuration files to the cluster allowing the new commands to be used. This tutorial will take you through the process of setting up SpatialHadoop on a single machine and running some examples. It can be easily expanded to a cluser of machines by following this tutorial.
- This tutorial sets up SpatialHadoop on Linux. If you want to run SpatialHadoop on Windows, you can try Hortonworks.
JavaTM 1.6.x, preferably from Sun, must be installed. We recommend using Java 1.6.0_24.
To check the version of Java installed type
$ java -version
To install java type
$ sudo apt-get install openjdk-6-jre-headless
Download the latest version of SpatialHadoop here.
Configure on one machine
- Extract the downloaded compressed file into a local directory
Edit conf/hadoop-env.sh in the extracted folder and set JAVA_HOME to your current Java installation.
The default location in Ubuntu installations is:
/usr/lib/jvm/java-6-openjdk/jre-- For 32-bit version
/usr/lib/jvm/java-6-openjdk-amd64/jre-- For 64-bit version
You can lookup your java home directory by typing
$ which java
To set JAVA_HOME, find a commented line that looks like
# export JAVA_HOME=<something>
Uncomment this line, and replace <something> with your Java home path.
Once you have SpatialHadoop configured correctly, you are ready to run some sample programs. The following steps will generate a random file, index it using a Grid index, and run some spatial queries on the indexed file. The classes needed for this example are all contained in the spatialhadoop*.jar shipped with the binary release. You can type 'bin/shadoop' to get the usage syntax for the available operations.
To generate a random file containing random rectangles, enter the following command
$ bin/shadoop generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect
This generates a 1GB file named 'test', where all rectangles in the file are contained in the rectangle with corner at (0,0) and dimensions 1Mx1M units.
If you have your own file that needs to be processed, you can upload it the same way you do with traditional Hadoop by typing
$ bin/hadoop fs -copyFromLocal <local file path> <HDFS file path>
Then you can index this file using the following command
To index this file using a Grid index
$ bin/shadoop index test test.grid mbr:0,0,1000000,1000000 sindex:grid
To see how the grid index partitions this file, type:
$ bin/shadoop readfile test.grid
This shows the list of partitions in file, each defined by boundaries, along with the number of blocks in each partition.
To run a range query operation on this file
$ bin/shadoop rangequery test.grid rq_results rect:500,500,1000,1000
This runs a range query over this file with the query range set to the rectangle at (500,500) with dimensions 1000x1000. The results will be stored in an HDFS file named 'rq_result'
To run a knn query operation on this file
$ bin/shadoop knn test.grid knn_results point:1000,1000 k:1000
This runs a knn query where the query point is at (1000,1000) and k=1000. The results are stored in HDFS file 'knn_results'
To run a spatial join operation
First, generate another file and have it indexed on the fly using the command
$ bin/shadoop generate test2.grid mbr:0,0,1000000,1000000 size:100.mb sindex:grid
Now, join the two files via the Distributed Join algorithm using the command
$ bin/shadoop dj test.grid test2.grid sj_results
SpatialHadoop virtual machine
For your convenience, SpatialHadoop is also available as a virtual machine image. This virtual machine has the latest version of SpatialHadoop already installed and configured. Just run it and you are ready to go. This machine was created and tested on VirtualBox version-4.3.6 available for free download.
- Download the latest version of VirtualBox here
- Download the latest virtual machine image
- Start VirtualBox, choose from the drop down menus 'File'->'Import Appliance', and browse to the downloaded machine image.
- Start the virtual machine by clicking the 'Start' button. The username/password combination is 'shadoop/shadoop'.
- SpatialHadoop is located in '~/spatialhadoop-*' and configured to run in a pseudo-distributed mode.
SpatialHadoop v2 is available in several formats. If you need to run it in release on your existing cluster without losing your current setup, you can download the binary version which ships as an extension. If you don't have a Hadoop cluster already running, you can download the version shipped with Apache Hadoop which can be setup exactly like you setup a traditional Hadoop cluster. If you just want to play around with SpatialHadoop and test its features, you probably want to download the virtual machine image (coming soon) which is already configured and ready to use. There is also a development virtual machine (coming soon) that you can use to develop your own programs in SpatialHadoop. Finally, there is an Amazon EC2 Image (coming soon) which can be used to quickly setup a SpatialHadoop cluster in the cloud.
Find here the list of all available downloads for SpatialHadoop Source code
- SpatialHadoop as an extension
- SpatialHadoop shipped with Apache Hadoop 1.2.1
- Virtual machine image
- Amazon EC2
SpatialHadoop v2 Beta 2
The previous releases of SpatialHadoop are still available for your convenience.