Setting up SpatialHadoop on Amazon EC2 (Works only for Hadoop 1.x)
This tutorial describes how to set up a cluster on Amazon EC2 that runs SpatialHadoop. The process is very similar to install Hadoop with an extra step that install SpatialHadoop.
- The first step is to download and expand the latest SpatialHadoop binary folder in your local machine.
- Edit the file '<spatialhadoop>/src/contrib/ec2/bin/hadoop-ec2-env.sh'. Set the values of 'AWS_ACCOUNT_ID', 'AWS_ACCESS_KEY_ID' and 'AWS_SECRET_ACCESS_KEY' as your Amazon EC2 account. This ensures that the script can access your Amazon account and start the instances there. For more details, check how to run Hadoop on Amazon EC2.
- Edit the file '<spatialhadoop>/src/contrib/ec2/bin/hadoop-ec2-env.sh'. Set HADOOP_VERSION to '1.2.1' and S3_BUCKET to '512500806257'. This bucket contains a recent Amazon image with Hadoop 1.2.1 installed. This will be used as the base version.
- Edit the file '<spatialhadoop>/src/contrib/ec2/bin/hadoop-ec2-init-remote.sh'. Add the following (highlighted) line right after the line that starts with 'HADOOP_HOME ...'.
if [ "$IS_MASTER" == "true" ]; then
MASTER_HOST=`wget -q -O - http://169.254.169.254/latest/meta-data/local-hostname`
fi
HADOOP_HOME=`ls -d /usr/local/hadoop-*`
wget -qO- http://spatialhadoop.cs.umn.edu/downloads/spatialhadoop-2.3.tar.gz | tar --directory $HADOOP_HOME -xvz
################################################################################
# Hadoop configuration
# Modify this section to customize your Hadoop cluster.
################################################################################
Note: You can replace 'spatialhadoop-2.3.tar.gz' with 'spatialhadoop-latest.tar.gz'. This will install a more recent version of SpatialHadoop which has some new features and bug fixes. However, it might not be as stable as the release version.
- Now your cluster is ready to start. You can launch a new cluster by typing
bin/hadoop-ec2 launch-cluster test-cluster 2
For more details, check how to run Hadoop on Amazon EC2.
Using SpatialHadoop with Amazon Elastic MapReduce (EMR) (Works for both Hadoop 1.x and 2.x)
Amazon provides an alternative way to running MapReduce job through the Elastic MapReduce (EMR) service. The service takes the burden of configuring and starting the Hadoop cluster using a simple web console or through a command line interface. SpatialHadoop can run on EMR clusters by providing a bootstrap action that installs SpatialHadoop as the cluster is starting.
In this tutorial, we will show how to install SpatialHaodop using the web console but the same technique can be used in the command line interface.
-
Start the "New Cluster" wizard by clicking the "Create Cluster" button in the web console.
-
Choose the version of Hadoop you want to start. In this tutorial, we will use Amazon's distribution of Hadoop which builds on Apache Hadoop 2.4.0. You can also choose an older version but it is not recommended by Amazon. We did not test SpatialHadoop with MapR distribution so it is up to you to choose that version.
-
In the "Bootstrap Actions" section, add a new bootstrap action, choose "Custom action" and click "Configure and add".
-
In the name field enter "Install SpatialHadoop", in the S3 location enter "s3://shadoop-emr/install-shadoop.rb" and leave the "Optional arguments" field blank. When you are done, click "Add".
Hint: Leaving the "Optional arguments" feed blank will automatically install the most recent version of SpatialHadoop. If you would like to install a specific version, enter the download URL of the SpatialHadoop package as an argument. For example, if you would like to install SpatialHadoop 2.2, enter "http://spatialhadoop.cs.umn.edu/downloads/spatialhadoop-2.2.tar.gz" as an argument.
-
You can just start the cluster without specifying any steps and it will have SpatialHadoop installed on it. If you would also like to run some steps, you can add choose the "Custom JAR" step and click "Configure and add".
Enter a suitable name to the step and specify the JAR location as "/home/hadoop/spatialhadoop-main.jar". In the "arguments" field, specify the command you would like to run along with any arguments as shown in the figure below.
-
Finally, you can start the cluster using the "Create cluster" button at the bottom of the page.