Compile SpatialHadoop from Source

To compile SpatialHadoop from source code, you need to obtain a version of the source code, the required development tools and then issue a compile command.

Required Tools

Compilation Steps

  1. Obtain source code: You can use git to get the source code from github using the command
    git clone

    If you do not wish to use git, you can download the source code as an archive from the project page at github
  2. Compile: Navigate to the source code directory and issue the command

    This will compile the source code according to Hadoop 1.2.1. If you want to compile it with the libraries of Hadoop 2.x, issue the command
    ant compile2

  3. Generate a runnable jar: To build a runnable jar file that contains the libraries of SpatialHadoop and can be run using the
    hadoop jar
    command, issue the command
    ant dist1

    Similarly, to generate a runnable jar for Hadoop 2.x, use the command:
    ant dist2

    Notice that the generated jar contains only the classes of SpatialHadoop without any third party libraries (e.g., JTS). This means you cannot run this jar unless your Hadoop distribution has all the required libraries.
  4. Generate a portable runnable jar: To create a portable runnable jar that will run on Apache Hadoop 1.x or compatible version, use the command:
    ant emr-jar1

    To generate a portable runnable jar for Hadoop 2.x, use the command:
    ant emr-jar2
  5. Create a distribution package: If you want to create a redistribution package which can be installed on any Hadoop version, use the command
    ant package1

    And of course for Hadoop 2.x, use the command:
    ant package2

    This will generate a .tar.gz package which contains all required libraries and files in a directory hierarchy similar to that of Apache Hadoop. You need to extract these generated files on the Hadoop home of every cluster node and then restart the cluster to let all nodes load their libraries.

Comparison of the installation techniques

We described three ways to build the binaries of SpatialHadoop and run them. There is a tradeoff between performance and portability among the three techniques as described below.

The distribution package technique is the most efficient as it injects all SpatialHadoop classes and required libraries into the Hadoop distribution so that they are all loaded at the startup on every Hadoop node. This means when you run any SpatialHadoop command, it is served directly from the classes in memory without loading any classes from disk. The drawback is that whenever you need to change the classes of SpatialHadoop, you will need to reinstall the new libraries on every Hadoop node and restart the cluster before you can use it.

The portable runnable jar is the other extreme. It creates one runnable jar file that contains SpatialHadoop classes in addition to all required libraries. This jar file can be executed using the

hadoop jar
command on any Hadoop distribution. This means it has to distribute the jar file to all cluster nodes before running every job. In addition, each machine has to load all classes from the jar file on each run. Although this adds some overhead on Hadoop, it has the advantage of being able to run on any cluster even if you do not have administrator access to it as you do not need to restart the cluster or add any files to the home directory.

The runnable jar balances the tradeoff between the two other techniques. In this case, you only install the third party libraries in your Hadoop distribution and then restart the cluster to have these libraries loaded. However, SpatialHadoop classes are not installed as part of the cluster. This makes it easy to modify the code of SpatialHadoop, recompile it and run it without the need to restart the cluster. This technique still requires administrative access to the cluster to install thrid party libraries which are not part of the default Hadoop distribution.