SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. Use it to analyze your huge spatial datasets on a cluster of machines.
What is new
- The source code is restructured to use Maven instead of Ant. Ant is still supported for backward compatibility but will be discontinued soon.
- Added test units for increased code robustness
- SpatialHadoop is now available on Maven central repository for better integration with other projects
- Hadoop 2.x is now the standard Hadoop version to use. Hadoop 1.x is supported in a different branch and will be discontinued in the future.
- Uber JAR is now available on the website for better use with an existing Hadoop cluster without the need of any setup or installation steps
- General bug fixes and performance improvement, especially in the visualization package
- A renovated visualization layer for generic and extensible visualization of big data.
- Improved implementations for single level and multilevel image visualization.
- Migrate basic operations to the new MapReduce layer for better compliance with other systems.
- Integrate range query in the input format and record reader to better use across operations.
- General bug fixes and performance improvement
- Major fixes for increased stability
- Use GenericOptionsParser in all operations for more flexibility and easier customization
- A new experimental option '-fast' in Plot command which is expected to run faster
- Plot and range query commands are handled faster when running on a small region on indexed files
- Better handle of boundary cases when indexing points
- Fix some bugs with the visualizer interface which is accessible at '/visualizer.jsp'
- Added support for user-defined shape classes in external JAR files
Many thanks to all our users who helped us testing and debugging SpatialHadoop. Special thanks to:
- Reem Ali
- Faraz Rasheed
- Preliminary support for HDF files in MODIS datasets brought by NASA.
- SpatialHadoop operations are now easier to access through a new bin/shadoop command
- Several bug fixes in visualization operations
Version 2 beta 2
- An all new PlotPyramid operation that plots a set of tiles ready to be visualized using an engine similar to Google Maps. Users can control the size of each tile as well as the number of zoom levels generated.
Version 2 beta 0
- Version 2 of SpatialHadoop is a remodel of the system that is much more flexible and powerful. Most importantly, this version ships mainly as an extension which can be plugged into an existing Hadoop cluster. This means you can keep using your own Hadoop cluster and just extend it with spatial capabilities by installing SpatialHadoop on it. You still get the orders of magnitude performance improvement because SpatialHadoop classes work with the core Hadoop allowing it to efficiently process spatial data.
SpatialHadoop v2 also ships with many new features and improvements.
- Basic data types now support double precision numbers for coordinates
- Added support for OGC standard data types and operations such as Linestring and MultiPolygon
- SpatialHadoop can now run in standalone mode making it easier than before to debug
- New operations added including convex hull, farthest/closest pair and skyline
- Several performance improvements and bug fixes for existing operations
Installing and configuring SpatialHadoop
SpatialHadoop is designed in a generic way which allows it to run on any configured Hadoop cluster. SpatialHadoop was tested on Apache Hadoop 1.2.1 but it should run seamlessly on other distributions of Hadoop as well. First, you need to setup traditional Hadoop. Once Hadoop is configured, you can install SpatialHadoop on that distribution which adds the new classes and configuration files to the cluster allowing the new commands to be used. This tutorial will take you through the process of setting up SpatialHadoop on a single machine and running some examples. It can be easily expanded to a cluser of machines by following this tutorial.
- This tutorial sets up SpatialHadoop on Linux. If you want to run SpatialHadoop on Windows, you can try Hortonworks.
JavaTM 1.6.x, preferably from Sun, must be installed. We recommend using Java 1.6.0_24.
To check the version of Java installed type
$ java -version
To install java type
$ sudo apt-get install openjdk-6-jre-headless
Download the latest version of SpatialHadoop here.
Configure on one machine
- Download and extract your preferred version of Hadoop. We tested the latest version of SpatialHadoop on Apache Hadoop 2.7.2.
- Extract the downloaded compressed file into the home directory of Hadoop.
Edit etc/hadoop/hadoop-env.sh in your Hadoop home directory and set JAVA_HOME to your current Java installation.
The default location in Ubuntu installations is:
/usr/lib/jvm/java-6-openjdk/jre-- For 32-bit version
/usr/lib/jvm/java-6-openjdk-amd64/jre-- For 64-bit version
You can lookup your java home directory by typing
$ which java
To set JAVA_HOME, find a commented line that looks like
# export JAVA_HOME=<something>
Uncomment this line, and replace <something> with your Java home path.
Once you have SpatialHadoop configured correctly, you are ready to run some sample programs. The following steps will generate a random file, index it using a Grid index, and run some spatial queries on the indexed file. The classes needed for this example are all contained in the spatialhadoop*.jar shipped with the binary release. You can type 'bin/shadoop' to get the usage syntax for the available operations.
To generate a random file containing random rectangles, enter the following command
$ bin/shadoop generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect
This generates a 1GB file named 'test', where all rectangles in the file are contained in the rectangle with corner at (0,0) and dimensions 1Mx1M units.
If you have your own file that needs to be processed, you can upload it the same way you do with traditional Hadoop by typing
$ bin/hadoop fs -copyFromLocal <local file path> <HDFS file path>
Then you can index this file using the following command
To index this file using a Grid index
$ bin/shadoop index test test.grid mbr:0,0,1000000,1000000 sindex:grid shape:rect
To see how the grid index partitions this file, type:
$ bin/shadoop readfile test.grid
This shows the list of partitions in file, each defined by boundaries, along with the number of blocks in each partition.
To run a range query operation on this file
$ bin/shadoop rangequery test.grid rq_results rect:500,500,1000,1000 shape:rect
This runs a range query over this file with the query range set to the rectangle at (500,500) with dimensions 1000x1000. The results will be stored in an HDFS file named 'rq_result'
To run a knn query operation on this file
$ bin/shadoop knn test.grid knn_results point:1000,1000 k:1000 shape:rect
This runs a knn query where the query point is at (1000,1000) and k=1000. The results are stored in HDFS file 'knn_results'
To run a spatial join operation
First, generate another file and have it indexed on the fly using the command
$ bin/shadoop generate test2.grid mbr:0,0,1000000,1000000 size:100.mb sindex:grid shape:rect
Now, join the two files via the Distributed Join algorithm using the command
$ bin/shadoop dj test.grid test2.grid shape:rect sj_results
SpatialHadoop virtual machine
For your convenience, SpatialHadoop is also available as a virtual machine image. This virtual machine has the latest version of SpatialHadoop already installed and configured. Just run it and you are ready to go. This machine was created and tested on VirtualBox version-4.3.6 available for free download.
- Download the latest version of VirtualBox here
- Download the latest virtual machine image
- Start VirtualBox, choose from the drop down menus 'File'->'Import Appliance', and browse to the downloaded machine image.
- Start the virtual machine by clicking the 'Start' button. The username/password combination is 'shadoop/shadoop'.
- SpatialHadoop is located in '~/spatialhadoop-*' and configured to run in a pseudo-distributed mode.
SpatialHadoop v2 is available in several formats. If you need to run it in release on your existing cluster without losing your current setup, you can download the binary version which ships as an extension to your Hadoop installation. If you just want to play around with SpatialHadoop and test its features, you can probably download the virtual machine image which is already configured and ready to use. This virtual machine also ships with the development environment already set up which makes it easier for your to add your own code to SpatialHadoop or implement new data types or operations. Finally, you can use SpatialHadoop with Amazon EC2 or EMR which enable you to quickly start a SpatialHadoop cluster in the cloud.
Find here the list of all available downloads for SpatialHadoop Source code
- SpatialHadoop as an extension to any standard Hadoop distribution based on Apache Hadoop 2.x
- Uber JAR which can run right away on any standard Hadoop distribution
- Virtual machine image
- Amazon EC2
SpatialHadoop v2 Beta 2
The previous releases of SpatialHadoop are still available for your convenience.