Update: Instructions updated for hadoop 0.20.2.
This is the first post of a series of small hadoop tutorials introducing progressively core hadoop functionnalities. You might be interested in that series if you recognized yourself in one or more of the following points :
- You’ve heard about the basics of MapReduce (else check the links that I recommend at the end of this post)
- Even if you’re not working at Google, Yahoo, Facebook (or many others) for now, you know it’s been years that MapReduce/Hadoop has become a must-have skill and that you should practice it but you have very few time
- You did try to read some tutorials but it was always either not hands-on enough or too much detailed
This first post is dedicated to build what I called a “MapReduce Learning Playground”: for practice or for a real need, you read or wrote on a sheet of paper the map and reduce functions that might solve a particular problem and you want to see it in action, not necessarily on huge data sets, just check that it computes the correct answer.
A lot of material can be found on the internet to do the same. The steps below are my attempt to present the best part of all the training material that I read on the subject, adapt it, adding it some glue (here with maven) and compile the whole into something that I hope will save you some time.
Step 1: Install the cloudera training virtual machine
Cloudera is really doing a great job at providing training material for hadoop. The most useful one in my opinion is their hadoop training virtual machine. Update: they changed a lot of things since that post was written, here is a better link for a training virtual machine (Thanks Karthick for letting me know that the old link was broken ). It provides a VMWare image of a Linux Ubuntu distribution with a pre-installed hadoop cluster in Pseudo-Distributed Mode.
To install the VM on your computer, just follow their instructions, it is free (except if you’re on Mac) and very easy. The VM comes also with hadoop related tools already installed like hive and pig (it will probably be the subject of other posts).
At the end of the installation, open the VMWare Player, start the cloudera VM (with training/training as user/pass) and you should get something like this:
Step 2: Creating an “hadoop ready” project with maven
Cloudera does provide some training projects already mounted in the eclipse installed in the VM but those projects contains several small errors (like missing dependencies). Even if those are very easily fixed, I describe here how to build your own project from scratch; it will give you a better basis in case you want to extend them and you’ll always be able to copy-paste the map-reduce functions of some interesting cloudera training projects into your own ones.
First install maven on the VM. Open a terminal from the VM and type:
sudo apt-get install maven2
If for some reasons you run into trouble with the installation of maven, you can always download it directly from here. Assuming you are unzipping it into the /user/local/apache-maven directory, you can add those lines into your /home/training/.bashrc configuration file:
export M2_HOME=/usr/local/apache-maven export M2=$M2_HOME/bin export PATH=$M2:$PATH
Then from the same terminal go into the workspace directory (usually located at ~/workspace) and create a java project hierarchy using the following maven command (change the groupId and the artifactId as you like):
mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.philippeadjiman.hadooptraining -DartifactId=hadoop-first-example
Then enter into the hadoop-first-example directory and generate the necessary files for eclipse:
Then open eclipse from the VM then File -> Import -> Existing Projects into Workspace -> Browse, choose the hadoop-first-example directory, OK -> Finish. Then you should see your project on the left side.
You may have an error on a M2_REPO unresolved variable, that’s OK, it’s because it is the first time that this eclipse use a maven project. To fix it, just right click on your project -> Build Path -> Configure Build Path -> Add Variable -> Configure Variables -> New. In the name type M2_REPO and in the path type /home/training/.m2/repository (just check that this directory exists).
Then you’ll have to add the hadoop jar dependency. To do so, you just have to open you pom.xml file (you’ll see it at the bottom of your project) and the following dependency (add it just before the </dependencies> closing tag ):
You can check here if there is a newer version. Note also that if you plan to use hadoop to run with a specific framework built on top of it then make sure you’re using the right version. E.g. for mahout, use the version you’ll find here.
Then you can go back to the terminal in your hadoop-first-example directory and type again mvn eclipse:eclipse to regenerate the eclipse files with now the hadoop dependency. You can now refresh your directory. You have now an “hadoop-ready” project.
Step 3: Put your map reduce program into your project and prepare the data on HDFS
The first time you heard about MapReduce, there is a good chance that you also heard about the word count example. The wordCount code on hadoop website is quiet outdated for hadoop v0.20, here is a link to a blog post with a more updated version of the word count code that will work with the 0.20.2 hadoop version used in that tutorial (Thanks Yi!). Be sure to put that code into the src directory of your project into a package called com.philippeadjiman.hadooptraining (or whatever else, as long as it matches to your package declaration).
Then before to deploy the job, you’ll have to count some words. Following a moby dick tradition in this blog, let’s download the full english raw text of moby dick that you can find here. If you want to run the hadoop job on it, you’ll have to put this file on HDFS, the underlying hadoop file system (check the “useful links” section below if you’ve never heard about HDFS).
Navigating, reading from and writing to HDFS is super simple and if you’re already familiar with regular unix file system commands then you’ll get it instantly: almost all the commands are the same, you just have to pass them to the wrapper command hadoop with ‘fs’ as argument and a ‘-’ before the command. Examples (to run from a regular terminal):
hadoop fs -help # will print all the command that you can execute on the HDFS hadoop fs -ls # will perform an ls from the HDFS home directory (set to /user/training in the VM). hadoop fs -mkdir input # will create the directory 'input' in the HDFS home directory (check if it does not already exists) hadoop fs -mkdir output # will create the directory 'input' in the HDFS home directory (check if it does not already exists) hadoop fs -put mobyDick.txt input # will put your local copy of mobyDick into the directory 'input' on the HDFS
Step 4: package your job, run it, observe the result
To launch your job on the hadoop infrastructure, you’ll have to package it into a jar file. With maven, nothing is more simple. Just go into your hadoop-first-example directory and type:
This will generate a jar into a sub-directory called target. Your jar will have a name like hadoop-first-example-1.0-SNAPSHOT.jar (you can change that generated name by editing the jar section of your pom). You can check that your jar file contains as expected the WordCount.class (and its inner classes) by typing:
jar -tf hadoop-first-example-1.0-SNAPSHOT.jar
You can now launch your hadoop job by executing the following command (adapt it with the correct package name if necessary):
hadoop jar hadoop-first-example-1.0-SNAPSHOT.jar com.philippeadjiman.hadooptraining.WordCount /user/training/input /user/training/output
Cloudera also comes with an handy web interface allowing to, among other things, monitor the jobs running on the cluster. Just open firefox and go to the url http://localhost:8088/. From the menu at the top right side of the page, choose job browser and you should see the status of your job. You can also click on it to see the details (status and number of mapper/reducer of your job):
You can also see your job output from there but I found it easier to check it directly from HDFS:
hadoop fs -ls output # should now contains something (a file named part-00000 should be your output) hadoop fs -cat output/part-00000 | less # will let you browse easily your output
Important: if you want to run your job another time, you’ll have to first erase all the current files of the output:
hadoop fs -rmr output # erase all the files contained in the output directory
You’ll notice that the output file contains many words with the ” character and other similar noise. This is of course because a simple StringTokenizer is used in the map function. To parse the text correctly, consider using some standard analyzers from Lucene for instance (you have an example in the code of step 2 of this post).
Step 5: Modify, Customize, Play, Learn. Now start the real fun…
Now that you built and deployed you own project from scratch, you have all you needs to modify certain parts of the code, create new map/reduce programs, test methods from the hadoop api, observe the results.
You can for instance try to see what happens if you use more than one reducer in the word count job (using conf.setNumReduceTasks(2) in the main method). Which lines are sent to which reducer? How to control that? How to sort the output of word count by number of occurrence (highest number first)?
Also I recommend to go over this tutorial that shows how to build an inverted index (see the theory here) using map/reduce (the tutorial contains many broken references w.r.t. the cloudera VM but you don’t care since now you know what you’re doing )
Future posts of this series will leverage the playground we built here to illustrate and learn about other interesting stuff around hadoop.
- Around the MapReduce concept: of course you have the classics like the original google paper or the map reduce tutorial of the hadoop website or the cloudera training sessions. But I would also recommend more funky introductions like the example described in this section of the excellent introduction to information retrieval (a must read book for search engineers) or this very nice post by Joel Spolsky.
- Around HDFS: description of the HDFS architecture from the hadoop website. Since HDFS design is inspired by the Google file system, it may worth to read about it here (original Google paper) or here (a nice vulgarization of the last one).
- Related projects: Mahout (I described here my experience using Mahout Taste for a startup project), Pig (and a nice tutorial by Yahoo on the cloudera website), HBase (an open source implementation of Google BigTable), and many others that you can find by starting from the hadoop website.