Tutorial:Running MapReduce on StratusLab

In this tutorial we will so you how you can quickly setup a Hadoop cluster using stratus-run-cluster command and a pre-cooked image from the StratusLab marketplace, in order to run MapReduce applications. MapReduce is a patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Apache Hadoop is an open source implementation of MapReduce.

Setup a Hadoop cluster

Start by preparing a cluster of say 3 dual core nodes using stratus-run-cluster command:

stratus-run-cluster -n 3 --cluster-user=hadoop --ssh-hostbased --shared-folder=/home -t m1.xlarge CcMFFcnWsQaFpq__Bk_NmP73faX

The above will instantiate 3 machines, create a user hadoop and share the /home folder over NFS. The above command uses a pre-cooked appliance distributed from the StratusLab marketplace. This appliance uses Fedora 14 as base OS and has been prepared with the current stable version of Hadoop (0.20.203) and JDK 1.6. The image has been configured to provide a single master multi-node installation of Hadoop following the instructions from this tutorial by Michael G. Noll.

Once the instantiation and configuration of the cluster node finishes login to the master node switch to hadoop user and copy the file /tmp/cluster_nodelist to ~/hadoop_slaves.

su - hadoop
export PATH=/opt/hadoop-0.20.203.0/bin:$PATH
cp /tmp/cluster_nodelist ~/hadoop_slaves

That’s it! You are ready to run MapReduce applications.

Note that if you don’t wish the master to act as a worker node remove its IP from the above file (typically will be the first node appearing).

Hadoop provides a web monitoring application that helps you track the status of your cluster, the running jobs, storage space etc. The monitor is available from http://master:50030 where master is the hostname/ip of the VM acting as master node for the cluster.

Run a sample MapReduce application

In this example we follow the above mentioned Hadoop tutorial to run the word count application that ships together with Hadoop, on a large set of text files.

Format the namenode and create an HDFS file system:

hadoop namenode -format

Download a set of books from Project Gutenberg and place them as plain-text under /tmp/gutenberg

mkdir /tmp/gutenberg
cd /tmp/gutenberg/
wget http://www.gutenberg.org/ebooks/20417.txt.utf8
wget http://www.gutenberg.org/ebooks/5000.txt.utf8
wget http://www.gutenberg.org/ebooks/4300.txt.utf8
wget http://www.gutenberg.org/ebooks/132.txt.utf8
wget http://www.gutenberg.org/ebooks/1661.txt.utf8
wget http://www.gutenberg.org/ebooks/972.txt.utf8
wget http://www.gutenberg.org/ebooks/19699.txt.utf8

Start the servers:

start-dfs.sh
start-mapred.sh

Copy the files to HDFS:

hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg

Run the sample word count application:

hadoop jar /opt/hadoop-0.20.203.0/hadoop-examples-0.20.203.0.jar wordcount /user/hadoop/gutenberg /user/hadoop/gutenberg-output

Once the job is finished check the results

hadoop dfs -ls /user/hadoop/gutenberg-output
hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000 | more

As mentioned the above setup follows a single-master installation where the master node of the cluster acts both as Namenode and Datanode. For performance reasons you may want to split the above tasks in two different nodes. In principle it should be easy to dedicate one additional node from the cluster and configure hadoop to use it as a second master. For this you will need to change the configuration files under /opt/hadoop-0.20.203.0/conf following the instructions from Hadoop’s home page.

  • Bookmark at
  • Bookmark "Tutorial:Running MapReduce on StratusLab" at del.icio.us
  • Bookmark "Tutorial:Running MapReduce on StratusLab" at Digg
  • Bookmark "Tutorial:Running MapReduce on StratusLab" at Reddit
  • Bookmark "Tutorial:Running MapReduce on StratusLab" at Google
  • Bookmark "Tutorial:Running MapReduce on StratusLab" at StumbleUpon
  • Bookmark "Tutorial:Running MapReduce on StratusLab" at Facebook
  • Bookmark "Tutorial:Running MapReduce on StratusLab" at Twitter