• SpatialHadoop

Analyze your spatial data efficiently

SpatialHadoop is an open source MapReduce framework designed specifically to handle huge datasets of spatial data. SpatialHadoop is shipped with built-in spatial high level language, spatial data types, spatial indexes and efficient spatial operations.



SpatialHadoop Overview

SpatialHadoop is a MapReduce framework designed specially to work with spatial data. Use it to analyze your huge spatial datasets on a cluster of machines.

Installing and configuring SpatialHadoop

SpatialHadoop is built on Hadoop 1.0.3 and can be installed just the same way you do with the traditional Hadoop. This tutorial will take you through the process of setting up SpatialHadoop on a single machine and running some examples. It can be easily expanded to a cluser of machines by following this tutorial.

Prerequisites

  • SpatialHadoop is supported mainly under Linux. Windows is partially supported for development but not for production.
  • JavaTM 1.6.x, preferably from Sun, must be installed. We recommend using Java 1.6.0_24. ?

    To check the version of Java installed type
    $ java -version

    To install java type
    $ sudo apt-get install openjdk-6-jre-headless

  • SSH client and SSH server. ?

    To check if both SSH client and server are installed correctly type
    $ ssh localhost echo success
    You should see the message 'success' on screen

    To install both SSH client and server type
    $ sudo apt-get install openssh-client openssh-server

    To configure SSH to login to your local host with no password type
    $ mkdir ~/.ssh && touch ~/.ssh/authorized_key
    $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
    $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

    To allow SSH to login to your localhost without prompt type
    $ ssh localhost echo success
    If prompted, accept the fingerprint by typing "yes" and hiting "ENTER".

  • For Windows: Cygwin is Required for shell support in addition to the required software above.

Download

Download the latest version of SpatialHadoop here.

Configure on one machine

  • Extract the downloaded compressed file into a local directory
  • Edit conf/hadoop-env.sh in the extracted folder and set JAVA_HOME to your current Java installation. ?

    The default location in Ubuntu installations is:
    /usr/lib/jvm/java-6-openjdk/jre -- For 32-bit version
    /usr/lib/jvm/java-6-openjdk-amd64/jre -- For 64-bit version
    You can lookup your java home directory by typing
    $ which java

    To set JAVA_HOME, find a commented line that looks like
    # export JAVA_HOME=<something>
    Uncomment this line, and replace <something> with your Java home path.

  • Configure SpatialHadoop by using the following configuration files
    conf/core-site.xml:
    <configuration>
         <property>
             <name>fs.default.name</name>
             <value>hdfs://localhost:9000</value>
         </property>
    </configuration>
    
    conf/hdfs-site.xml:
    <configuration>
         <property>
             <name>dfs.replication</name>
             <value>1</value>
         </property>
    </configuration>
    
    conf/mapred-site.xml:
    <configuration>
         <property>
             <name>mapred.job.tracker</name>
             <value>localhost:9001</value>
         </property>
    </configuration>
    
  • Format Hadoop Distributed File System (HDFS) by typing
    $ bin/hadoop namenode -format
  • Start a local SpatialHadoop cluster by typing
    $ bin/start-all.sh

Usage examples

Once you have SpatialHadoop configured correctly, you are ready to run some sample programs. The following steps will generate a random file, index it using a Grid index, and run some spatial queries on the indexed file. The classes needed for this example are all contained in the hadoop-operations-*.jar shipped with the binary release. You can type 'bin/hadoop jar hadoop-operations-*.jar' to get the usage syntax for the available operations.

To generate a random file containing random rectangles, enter the following command

$ bin/hadoop jar hadoop-operations-*.jar generate test mbr:0,0,1000000,1000000 size:1.gb shape:rect

This generates a 1GB file named 'test', where all rectangles in the file are contained in the rectangle with corner at (0,0) and dimensions 1Mx1M units.


If you have your own file that needs to be processed, you can upload it the same way you do with traditional Hadoop by typing

$ bin/hadoop fs -copyFromLocal <local file path> <HDFS file path>

Then you can index this file using the following command


To index this file using a global-only Grid index

$ bin/hadoop jar hadoop-operations-*.jar index test test.grid mbr:0,0,1000000,1000000 global:grid

To see how the grid index partitions this file, type:

$ bin/hadoop jar hadoop-operations-*.jar readfile test.grid

This shows the list of partitions in file, each defined by boundaries, along with the number of blocks in each partition.


To run a range query operation on this file

$ bin/hadoop jar hadoop-operations-*.jar rangequery test.grid rq_results rect:500,500,1000,1000

This runs a range query over this file with the query range set to the rectangle at (500,500) with dimensions 1000x1000. The results will be stored in an HDFS file named 'rq_result'


To run a knn query operation on this file

$ bin/hadoop jar hadoop-operations-*.jar knn test.grid knn_results point:1000,1000 k:1000

This runs a knn query where the query point is at (1000,1000) and k=1000. The results are stored in HDFS file 'knn_results'


To run a spatial join operation

First, generate another file and have it globally indexed on the fly using the command

$ bin/hadoop jar hadoop-operations-*.jar generate test2.grid mbr:0,0,1000000,1000000 size:100.mb global:grid

Now, join the two files via the Distributed Join algorithm using the command

$ bin/hadoop jar hadoop-operations-*.jar dj test.grid test2.grid sj_results

SpatialHadoop virtual machine

For your convenince, SpatialHadoop is also available as a virtual machine image. This virtual machine has the latest version of SpatialHadoop already installed and configured. Just run it and you are ready to go. This machine was created and tested on VirtualBox version-4.2.6 available for free download.

  • Download the latest version of VirtualBox here
  • Download the latest virtual machine image here and extract it
  • Start VirtualBox, choose from the drop down menus 'Machine'->'Add Machine', and browse to the extracted machine image.
  • Start the virtual machine by clicking the 'Start' button. The username/password are 'shadoop/shadoop'.
  • SpatialHadoop is located in '~/hadoop-*' and configured to run in a single-machine setup.

All downloads

Contact us

If you have any comments or problems, drop us a message at:

SpatialHadoop Team
Ahmed Eldawy
Mohamed Mokbel

Latest News

June-2-2013
SpatialHadoop is presented this VLDB 2013 in Trento, Italy (Aug 26-30).

More news»

SpatialHadoop on github

SpatialHadoop is open source and hosted on github. You're welcome to download the source code, check how basic operations are implemented and implement your own operations.

Check SpatialHadoop on github»