• SpatialHadoop

Analyze your spatial data efficiently

SpatialHadoop is an open source MapReduce extension designed specifically to handle huge datasets of spatial data on Apache Hadoop. SpatialHadoop is shipped with built-in spatial high level language, spatial data types, spatial indexes and efficient spatial operations.



SpatialHadoop Overview

SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. Use it to analyze your huge spatial datasets on a cluster of machines.

What is new

Version 2.1

  • Preliminary support for HDF files in MODIS datasets brought by NASA.
  • SpatialHadoop operations are now easier to access through a new bin/shadoop command
  • Several bug fixes in visualization operations

Version 2 beta 2

  • An all new PlotPyramid operation that plots a set of tiles ready to be visualized using an engine similar to Google Maps. Users can control the size of each tile as well as the number of zoom levels generated.

Version 2 beta 0

  • Version 2 of SpatialHadoop is a remodel of the system that is much more flexible and powerful. Most importantly, this version ships mainly as an extension which can be plugged into an existing Hadoop cluster. This means you can keep using your own Hadoop cluster and just extend it with spatial capabilities by installing SpatialHadoop on it. You still get the orders of magnitude performance improvement because SpatialHadoop classes work with the core Hadoop allowing it to efficiently process spatial data.
  • SpatialHadoop v2 also ships with many new features and improvements.
    • Basic data types now support double precision numbers for coordinates
    • Added support for OGC standard data types and operations such as Linestring and MultiPolygon
    • SpatialHadoop can now run in standalone mode making it easier than before to debug
    • New operations added including convex hull, farthest/closest pair and skyline
    • Several performance improvements and bug fixes for existing operations

Installing and configuring SpatialHadoop

SpatialHadoop is designed in a generic way which allows it to run on any configured Hadoop cluster. SpatialHadoop was tested on Apache Hadoop 1.2.1 but it should run seamlessly on other distributions of Hadoop as well. First, you need to setup traditional Hadoop. Once Hadoop is configured, you can install SpatialHadoop on that distribution which adds the new classes and configuration files to the cluster allowing the new commands to be used. This tutorial will take you through the process of setting up SpatialHadoop on a single machine and running some examples. It can be easily expanded to a cluser of machines by following this tutorial.

Prerequisites

  • This tutorial sets up SpatialHadoop on Linux. If you want to run SpatialHadoop on Windows, you can try Hortonworks.
  • JavaTM 1.6.x, preferably from Sun, must be installed. We recommend using Java 1.6.0_24. ?

    To check the version of Java installed type

    $ java -version

    To install java type

    $ sudo apt-get install openjdk-6-jre-headless

Download

Download the latest version of SpatialHadoop here.

Configure on one machine

  • Extract the downloaded compressed file into a local directory
  • Edit conf/hadoop-env.sh in the extracted folder and set JAVA_HOME to your current Java installation. ?

    The default location in Ubuntu installations is:

    /usr/lib/jvm/java-6-openjdk/jre
    -- For 32-bit version
    /usr/lib/jvm/java-6-openjdk-amd64/jre
    -- For 64-bit version
    You can lookup your java home directory by typing
    $ which java

    To set JAVA_HOME, find a commented line that looks like

    # export JAVA_HOME=<something>

    Uncomment this line, and replace <something> with your Java home path.

Usage examples

Once you have SpatialHadoop configured correctly, you are ready to run some sample programs. The following steps will generate a random file, index it using a Grid index, and run some spatial queries on the indexed file. The classes needed for this example are all contained in the spatialhadoop*.jar shipped with the binary release. You can type 'bin/shadoop' to get the usage syntax for the available operations.

To generate a random file containing random rectangles, enter the following command

$ bin/shadoop generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect

This generates a 1GB file named 'test', where all rectangles in the file are contained in the rectangle with corner at (0,0) and dimensions 1Mx1M units.


If you have your own file that needs to be processed, you can upload it the same way you do with traditional Hadoop by typing

$ bin/hadoop fs -copyFromLocal <local file path> <HDFS file path>

Then you can index this file using the following command


To index this file using a Grid index

$ bin/shadoop index test test.grid mbr:0,0,1000000,1000000 sindex:grid

To see how the grid index partitions this file, type:

$ bin/shadoop readfile test.grid

This shows the list of partitions in file, each defined by boundaries, along with the number of blocks in each partition.


To run a range query operation on this file

$ bin/shadoop rangequery test.grid rq_results rect:500,500,1000,1000

This runs a range query over this file with the query range set to the rectangle at (500,500) with dimensions 1000x1000. The results will be stored in an HDFS file named 'rq_result'


To run a knn query operation on this file

$ bin/shadoop knn test.grid knn_results point:1000,1000 k:1000

This runs a knn query where the query point is at (1000,1000) and k=1000. The results are stored in HDFS file 'knn_results'


To run a spatial join operation

First, generate another file and have it indexed on the fly using the command

$ bin/shadoop generate test2.grid mbr:0,0,1000000,1000000 size:100.mb sindex:grid

Now, join the two files via the Distributed Join algorithm using the command

$ bin/shadoop dj test.grid test2.grid sj_results

SpatialHadoop virtual machine

For your convenience, SpatialHadoop is also available as a virtual machine image. This virtual machine has the latest version of SpatialHadoop already installed and configured. Just run it and you are ready to go. This machine was created and tested on VirtualBox version-4.3.6 available for free download.

  • Download the latest version of VirtualBox here
  • Download the latest virtual machine image
  • Start VirtualBox, choose from the drop down menus 'File'->'Import Appliance', and browse to the downloaded machine image.
  • Start the virtual machine by clicking the 'Start' button. The username/password combination is 'shadoop/shadoop'.
  • SpatialHadoop is located in '~/spatialhadoop-*' and configured to run in a pseudo-distributed mode.

All downloads

SpatialHadoop v2 is available in several formats. If you need to run it in release on your existing cluster without losing your current setup, you can download the binary version which ships as an extension. If you don't have a Hadoop cluster already running, you can download the version shipped with Apache Hadoop which can be setup exactly like you setup a traditional Hadoop cluster. If you just want to play around with SpatialHadoop and test its features, you probably want to download the virtual machine image (coming soon) which is already configured and ready to use. There is also a development virtual machine (coming soon) that you can use to develop your own programs in SpatialHadoop. Finally, there is an Amazon EC2 Image (coming soon) which can be used to quickly setup a SpatialHadoop cluster in the cloud.

Find here the list of all available downloads for SpatialHadoop Source code

SpatialHadoop v2.1

SpatialHadoop v2 Beta 2

Previous releases

The previous releases of SpatialHadoop are still available for your convenience.

Contact us

If you have any comments or problems, drop us a message at:

SpatialHadoop Team
Ahmed Eldawy
Mohamed Mokbel

Latest News

Feb-23-2014
An updated virtual machine image is available

Feb-20-2014
A tutorial is released to set up SpatialHadoop on Amazon EC2

Dec-22-2013
TIGER real datasets released and are now available for download.

More news»

SpatialHadoop on github

SpatialHadoop is open source and hosted on github. You're welcome to download the source code, check how basic operations are implemented and implement your own operations.

Check SpatialHadoop on github»