In this tutorial we will so you how you can quickly setup a Hadoop cluster using stratus-run-cluster command and a pre-cooked image from the StratusLab marketplace, in order to run MapReduce applications. MapReduce is a patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers. Apache Hadoop is an open source implementation of MapReduce.
Start by preparing a cluster of say 3 dual core nodes using stratus-run-cluster command:
stratus-run-cluster -n 3 --cluster-user=hadoop --ssh-hostbased --shared-folder=/home -t m1.xlarge CcMFFcnWsQaFpq__Bk_NmP73faX
The above will instantiate 3 machines, create a user hadoop and share the /home folder over NFS. The above command uses a pre-cooked appliance distributed from the StratusLab marketplace. This appliance uses Fedora 14 as base OS and has been prepared with the current stable version of Hadoop (0.20.203) and JDK 1.6. The image has been configured to provide a single master multi-node installation of Hadoop following the instructions from this tutorial by Michael G. Noll.
Once the instantiation and configuration of the cluster node finishes login to the master node switch to hadoop user and copy the file /tmp/cluster_nodelist to ~/hadoop_slaves.
su - hadoop export PATH=/opt/hadoop-0.20.203.0/bin:$PATH cp /tmp/cluster_nodelist ~/hadoop_slaves
That’s it! You are ready to run MapReduce applications.
Note that if you don’t wish the master to act as a worker node remove its IP from the above file (typically will be the first node appearing).
Hadoop provides a web monitoring application that helps you track the status of your cluster, the running jobs, storage space etc. The monitor is available from http://master:50030 where master is the hostname/ip of the VM acting as master node for the cluster.
In this example we follow the above mentioned Hadoop tutorial to run the word count application that ships together with Hadoop, on a large set of text files.
Format the namenode and create an HDFS file system:
hadoop namenode -format
Download a set of books from Project Gutenberg and place them as plain-text under /tmp/gutenberg
mkdir /tmp/gutenberg cd /tmp/gutenberg/ wget http://www.gutenberg.org/ebooks/20417.txt.utf8 wget http://www.gutenberg.org/ebooks/5000.txt.utf8 wget http://www.gutenberg.org/ebooks/4300.txt.utf8 wget http://www.gutenberg.org/ebooks/132.txt.utf8 wget http://www.gutenberg.org/ebooks/1661.txt.utf8 wget http://www.gutenberg.org/ebooks/972.txt.utf8 wget http://www.gutenberg.org/ebooks/19699.txt.utf8
Start the servers:
start-dfs.sh start-mapred.sh
Copy the files to HDFS:
hadoop dfs -copyFromLocal /tmp/gutenberg /user/hadoop/gutenberg
Run the sample word count application:
hadoop jar /opt/hadoop-0.20.203.0/hadoop-examples-0.20.203.0.jar wordcount /user/hadoop/gutenberg /user/hadoop/gutenberg-output
Once the job is finished check the results
hadoop dfs -ls /user/hadoop/gutenberg-output hadoop dfs -cat /user/hadoop/gutenberg-output/part-r-00000 | more
As mentioned the above setup follows a single-master installation where the master node of the cluster acts both as Namenode and Datanode. For performance reasons you may want to split the above tasks in two different nodes. In principle it should be easy to dedicate one additional node from the cluster and configure hadoop to use it as a second master. For this you will need to change the configuration files under /opt/hadoop-0.20.203.0/conf following the instructions from Hadoop’s home page.