A MapReduce framework for spatial data
To compile SpatialHadoop from source code, you need to obtain a version of the source code, the required development tools and then issue a compile command.
git clone https://github.com/aseldawy/spatialhadoop2.git
antThis will compile the source code according to Hadoop 1.2.1. If you want to compile it with the libraries of Hadoop 2.x, issue the command
ant compile2
hadoop jarcommand, issue the command
ant dist1
ant dist2
ant emr-jar1
ant emr-jar2
ant package1
ant package2
We described three ways to build the binaries of SpatialHadoop and run them. There is a tradeoff between performance and portability among the three techniques as described below.
The distribution package technique is the most efficient as it injects all SpatialHadoop classes and required libraries into the Hadoop distribution so that they are all loaded at the startup on every Hadoop node. This means when you run any SpatialHadoop command, it is served directly from the classes in memory without loading any classes from disk. The drawback is that whenever you need to change the classes of SpatialHadoop, you will need to reinstall the new libraries on every Hadoop node and restart the cluster before you can use it.
The portable runnable jar is the other extreme. It creates one runnable jar file that contains SpatialHadoop classes in addition to all required libraries. This jar file can be executed using the
hadoop jarcommand on any Hadoop distribution. This means it has to distribute the jar file to all cluster nodes before running every job. In addition, each machine has to load all classes from the jar file on each run. Although this adds some overhead on Hadoop, it has the advantage of being able to run on any cluster even if you do not have administrator access to it as you do not need to restart the cluster or add any files to the home directory.
The runnable jar balances the tradeoff between the two other techniques. In this case, you only install the third party libraries in your Hadoop distribution and then restart the cluster to have these libraries loaded. However, SpatialHadoop classes are not installed as part of the cluster. This makes it easy to modify the code of SpatialHadoop, recompile it and run it without the need to restart the cluster. This technique still requires administrative access to the cluster to install thrid party libraries which are not part of the default Hadoop distribution.