A Data Science Exploration From the Titanic in R

Illustration of the (very hype) random forest learning method (click to see original website)

Kaggle offered this year a knowledge competition called “Titanic: Machine Learning from Disaster” exposing a popular ”toy-yet-interesting” data set around the Titanic. The goal is to  predict as accurately as possible the survival of the titanic’s passengers based on their characteristics (age, sex, ticket fare etc…)

.

In that post, we’ll use that data set in order to:

  1. Illustrate through a comprehensive example a set of useful tools/packages to do some predictive modelling from the R statistical framework.
  2. Take the opportunity of the example to illustrate the process and kind of tricks that it takes to improve/tune a predictive model.

The whole code creating all the plots/stats and models exposed in that post and also building an output reaching a score 0.79426 on the leaderboard can be found on github here  or on Rpubs here (built with Knit HTML from R studio ).

Preliminaries

First, download the test and training set from the data page of the competition (here is a zip of the two small files in case the page from kaggle is removed in the future).

Once you loaded the dataset into a data frame, you can do some data analysis/explorations.  Even though that part is critical to start playing and feeling the data, I won’t go into details because there already were blog posts written about that, in particular that one is a very nice R version of the getting started with excel data exploration tutorial on Kaggle’s website.

However, i’ll just illustrate a nice simple and effective way of observing one important aspect of the data: missing values.

The Amelia R package is a toolbox around missing values, in particular for performing imputation of the missing data. Getting a visual and global insight about missing data in the test and train set is as simple as that:

library(Amelia)
#... code for loading test and train data in a data frame
missmap(rawdata, main = "Missingness Map Train")
missmap(test, main = "Missingness Map Test")

Missingness Maps (click for higher quality)

From those maps, you can immediately observe that only the age feature is badly suffering from missing data. Considering how small is the training set, you can hardly just ignore records having a missing age. We’ll see later in the post what kind of strategy we can use to deal with that issue.

Building/Tuning models with Caret

The caret package is a kind of toolbox for homogenising the many existing R packages for classification and regression and also provide out of the box a standard way to perform common tasks like model parameters tuning and more. Also, the author (Max Khun) did an amazing job at documenting the package in the vignettes (here or here for a longer but older version) and on the package dedicated website.

Here is a snippet of code where i successively train a random forest and a gradient boosting machine (GBM) using the same train function from caret.

forest.model1 <- train(survived ~ pclass + sex + title + sibsp +parch ,
                               data.train,
                               importance=TRUE)

fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10)

gbm.model2 <- train(survived ~ pclass + sex + title + sibsp +parch ,
                    data.train,
                    distribution = "gaussian",
                    method = "gbm",
                    trControl = fitControl,
                    verbose = FALSE)

We’ll discuss later the features used in the formula but note the fitControl parameter which is passed in the call for training the GBM. This parameter allows to completely define the way the model parameters will be tuned. In that example, the model parameters of the GBM (namely interaction.depth, n.trees and shrinkage, see output below) were compared using a repeated 10-fold cross validation with accuracy being the metric for comparison, but everything is tuneable for that purpose (you can even pass a grid of specific values to compare for each model parameter).

712 samples
 13 predictors
  2 classes: 'yes', 'no' 

No pre-processing
Resampling: Cross-Validation (10 fold, repeated 10 times) 

Summary of sample sizes: 642, 640, 642, 641, 640, 640, ... 

Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy  Kappa  Accuracy SD  Kappa SD
  1                  50       0.8       0.565  0.0436       0.0964
  1                  100      0.801     0.567  0.0436       0.0965
  1                  150      0.801     0.568  0.0434       0.096
  2                  50       0.795     0.548  0.0426       0.097
  2                  100      0.801     0.559  0.0437       0.0999
  2                  150      0.804     0.565  0.0435       0.1
  3                  50       0.805     0.568  0.0449       0.102
  3                  100      0.807     0.573  0.0464       0.106
  3                  150      0.809     0.576  0.0442       0.1     

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were interaction.depth = 3, n.trees = 150 and shrinkage = 0.1.

Also, you can easily visualize variable importance (you need to specify importance=TRUE in the train function, as we did, for having it):

Variable Importance (click for higher quality)

You can observe that the variable value with the most importance is the title Mr . The interesting part is that the feature “title” was not initially in the data set and was artificially created (we’ll detail a bit more about it later in the post). But overall, caret offers a very nice framework for easy models comparison and tuning with proper/uniform built-in cross-validation routines.

One thing though that is so true and said in perfect way in this must-watch killer talk: “Don’t get stuck in algorithm land! Focus on putting better data in the algorithm”. We’ll see an example illustrating that later in the post.

Pick the best threshold for your classifier using ROC curves

Most classifiers usually output the probability of an example belonging to a specific class (here ‘survived’ or ‘died’). When the only matter is to optimise accuracy (as it is usually the case in competitions), it is useful to pick the optimal threshold/cutoff for assigning one class or the other.

ROC curves can be used for that and also to assess the robustness of your model. If you’ve never heard about ROC curves, this article gives the basic intuition and that paper goes much more into details while still being crystal clear (i warmly recommend the later if you’re interested in the subject). For a standalone very clear example in R, this post is what you need (the code below is inspired from it).

The pROC package allows to easily analyse and display ROC curves. Here, we’re interested in the threshold corresponding to the top left corner of the curve maximising sensitivity and specificity.

#code inspired from http://mkseo.pe.kr/stats/?p=790
result.predicted.prob.model1 <- predict(forest.model1, data.test, type="prob")
result.roc.model1 <-  roc(data.test$survived, result.predicted.prob.model1$yes)
plot(result.roc.model1, print.thres="best", print.thres.best.method="closest.topleft")

result.coords.model1 <- coords(  result.roc.model1, "best", best.method="closest.topleft",
                          ret=c("threshold", "accuracy"))
result.coords.model1

Which will output both a graph:

ROC curve (click for higher quality)

and high level information about the curve, e.g. :

Call:
roc.default(response = data.test$survived, predictor = result.predicted.prob.model1$yes)

Data: result.predicted.prob.model1$yes in 78 controls (data.test$survived yes) > 65 cases (data.test$survived no).
Area under the curve: 0.931

Note in particular the Area under the curve (a.k.a AUC) data point which is sometimes used to assess the robustness/quality of your model, although it has been questioned a lot and often criticised to not be a precise/useful classification performance measure (a small discussion around it can be found here). In other words, you’re often better off relying on your K-fold cross validation measures to assess your out-of-sample performance (c.f. the previous section on caret).

Tweak and tricks

I’ve hinted earlier that the number of missing ages was too high and the training set too small to just ignore the records having a missing age. At least for me, any attempt to impute the missing ages (either in naive or more sophisticated ways) didn’t lead to any significant accuracy improvement on the 10-fold cross validation test.

Turns out that extracting the title (i.e. Mr or Mrs. etc…) in the Name attribute of the data set did lead to an improvement (from the competition’s forums, i saw that few people used that feature as well). Let’s have a look at the age distributions per extracted title in the training set (some rare occurrences of titles were aggregated into larger titles, e.g. “Capt”, “Col”, “Major”,”Sir”, “Don”,”Dr” were mapped to “Mr”):

Age distributions per Title (click for higher quality)

This somehow matches the intuition (though I didn’t know that in apparently old/traditional english, “Master” denotes a young/unmarried man). And it also makes sense intuitively that Title is a good proxy for the too many missing ages, allowing for totally ignoring the age feature and thus keep all the data in the training set, without introducing any potential noise with an imputation method.

When i’ve plugged in this new Title feature into the random forest, i saw an improvement from 0.785 to 0.801 on my 10-fold cross validation out-of-sample accuracy estimation, and it was reflected in my submission on the public leaderboard where i jumped to the top 5% best submissions at that time.

Note that an improvement on your cross validation is not always reflected on the leaderboard, sometimes even the opposite (c.f “Lesson One” from this very cool blog post by @rouli, highly recommended). Note also that this particular competition lasts 1 year and was just for learning purpose, so there are thousands and thousands of participants, including not few people who obviously spent useless time to extract the answers from publicly available lists (e.g. here or here) to get a near perfect score (though you could use them to know you near real final score on the private leaderboard if you can’t wait the end of the competition, but still kind of pointless). Finally, more things can be done to try improve the accuracy even more, an obvious one being to combine multiple models together (majority vote is often used in binary/multi-class settings) but we won’t cover that in this post.

Conclusion

We explored on a comprehensive example how R can be used to build and tune quickly robust predictive models which are significantly outperforming the baseline. Of course, it is somehow a toy example but it was interesting enough to explore some important aspects needed when building predictive models. For much bigger data sets (both in terms of training set size and/or number of features in the data) you might need to introduce different/additional technical and theoretical tools that we might explore in future posts.

Also, note that a competition settings might be very different than a real production settings. Not only talking about why Netflix never implemented the model that won the $1M challenge,  but also the whole infrastructure that you’d need to build in order to do big data science at scale on many different problems (Scala is quickly becoming a trend around that, check those killer slides and talk by my friend @BigDataSc from LinkedIn and @ccservers from eBay for more on that ).

I’ll conclude by citing again this awesome sentence from this must-watch talk by @nmkridler : “Don’t get stuck in algorithm land! Focus on putting better data in the algorithm”. I really think that this is what data science is all about.

References / Useful Links

 

How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

tfidfYou have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and  Mahout can help you to do that almost in a snap.

Step 1 : Build a Lucene Index out of your document collection

If you don’t know how to build a Lucene index, check the links at the end of the post.

The two only important things in that step are to have in your index a field that can serve as a document id and to enable term vectors on the text field representing the content of your documents.

So your indexing code should contains at least two lines similar to:

doc.add(new Field("documentId", documentId, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED,TermVector.YES));

Step 2 : Use Mahout lucene.vector driver to generate weighted vectors from your lucene index

That step is well described here. It also explains how to generate the vectors from a directory of text documents. I used lucene because my documents were in a data store and building the lucene index out of it was just much more flexible and convenient.

You then should end up executing a command similar to:

 ./mahout lucene.vector --dir "myLucenIndexDirectory" --output "outputVectorPathAndFilename" --dictOut "outputDictionnaryPathAndFilename" -f content -i documentId -w TFIDF

Mahout will generate for you:

  • a dictionary of all tokens found in the document collection (tokenized with the Tokenizer you used in step 1 and that you might tune depending on your needs)
  • A binary SequenceFile (a class coming from hadoop) that will contains all the TF-IDF weighted vectors.

Step 3: Play with the generated vector file

Now, let’s say that you want for a given document id, to see what are the tokens that received the biggest weights in order to feel what are the most significant tokens of that document (as the weighting scheme sees it).

To do so, you can for instance easily load the content of the generated dictionary file into a Map with token index as keys and the tokens as values. Let’s call that map dictionaryMap.

Then you’ll have to walk through the generated binary file containing the vectors. By playing a little bit  with the sequence file and the Mahout source code, you get pretty quickly what are the important objects you have to manipulate in order to access vectors content in a structured way:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String vectorsPath = args[1];
Path path = new Path(vectorsPath);
 
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
LongWritable key = new LongWritable();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
	NamedVector namedVector = (NamedVector)value.get();
	RandomAccessSparseVector vect = (RandomAccessSparseVector)namedVector.getDelegate();
 
	for( Element  e : vect ){
		System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;
	}
}
reader.close();

The important things to get in that code are the following:

  • namedVector.getName() will contains the documentId
  • e.index() will ontains the index of the token as present in the dictionary output file, so you can get the token itself using
    dictionaryMap.get(e.index())
  • e.get() contains the weight itself

From there you’ll be able easily to plug your code to do whatever you want with the tokens and their weights, like printing the token having the biggest weights in a given document.

It can be insightful to tune your weighting model. E.g. you can quickly observe that typing errors are often getting a super high weight, which makes sense in the TF-IDF weighting scheme (unless the typing error is very frequent in your document collection), and thus you might want to fix that.

It is also useful just to understand a little bit more of how mahout represents the data internally.

Useful links:

A Generic Method For Sorting (Google Collections) Multiset Per Entry Count

I’m regularly using the excellent google collections library (now final and part of the more general guava libraries). One of the data structure I’m using the most is probably the multiset (a.k.a bag). But most of the time, when I need a multiset to track the number of occurrences of particular entries, I almost always also need to know what is the most occurring entry (or the top N occurring entries).

Let’s take a canonical example: as you are parsing a text, you’re inserting each tokens into a multiset to track their number of occurrences and you simply want to know what are the top N most occurring tokens (ok, if you want to do it on terabytes of data, you might want to start learning hadoop :) ).

I need those kind of statistics  so frequently that I was surprised to not find an existing utility method allowing to sort the entries of a multiset per entry count (or number of occurrences). Here is my attempt to do it in a generic short and efficient way:

public <T> List<Entry<T>> sortMultisetPerEntryCount(Multiset<T> multiset){
	Comparator<Multiset.Entry<T>> occurence_comparator = new Comparator<Multiset.Entry<T>>() {
		public int compare(Multiset.Entry<T> e1, Multiset.Entry<T> e2) {
			return e2.getCount() - e1.getCount() ;
		}
	};
	List<Entry<T>> sortedByCount = new ArrayList<Entry<T>>(multiset.entrySet());
	Collections.sort(sortedByCount,occurence_comparator);
 
	return sortedByCount;
}

If you got any other better or most efficient way to do it (or if you know an existing utility method that does it), please share.

If you never used google collections, in addition to the official website, you might find those tutorials (1 and 2)  useful for an introduction.

What Are The 10 Most Cited Websites On Twitter When Tweeting About Hot Trends?

Lately I wrote a post on how to build a relevant real time search engine prototype in few hundreds lines of code.  Using a tailored ranking algorithm based on link popularity in twitter,  I showed that the prototype was able to return very relevant answers in response to very hot queries like the ones that can be found in the hourly updated list of google hot trends.

I wrote a small program on top of this prototype to run an experiment: each hour, the program crawl the new list of hot queries from google hot trends, then it runs the prototype on each of those queries and keep the hottest link found in twitter for the corresponding hot query. I wanted to see which websites were mostly cited in those tweets talking about hot trends.

So I let ran the program for a week, collected the  links (more than a thousand), expanded all those into their long URLs version (using an improved version of my java universal URL expander),  extracted the domain names and compiled the whole into a top 10 list of the most cited websites. Here it is (click to enlarge):

top10twitterBuzzWebsites

The Most Cited Websites When Tweeting About Hot Trends. Click to enlarge.

I was surprised to see some websites that I’ve never heard about before (like wpparty.com or actionnewsblast.com).

To have a better idea for which kind of hot queries/topics those websites are most cited in twitter, find below, for each of those top website, a sample of 5 google hot trends query they covered last week.

Website Sample of 5 covered google hot trends of this past week
www.cnn.com 2011 budget
ipad tablet
cnn.com/haiti360
concorde crash
federer murray
sports.espn.go.com federer tsonga australian open
aaron miles
tom brookshier
jackson jeffcoat
paul pierce
wpparty.com jackson jeffcoat
leon russell
wanamaker mile
buffalo exchange
recalled toyotas
www.huffingtonpost.com governor of virginia
obama republican retreat
obama gop
apple tablet announcement
groundhog prediction
twitpic.com miss america 2010 winner
what celeb do i look like
footprints in the sand
apple itablet
itablet
www.youtube.com i pad
grammy awards 2010
bob kellar
lakers celtics
ipad a disappointment
www.facebook.com general beauregard lee
roberta flack
action express racing
slightly stoopid
rolex 24 hours daytona
www.actionnewsblast.com codswallop meaning
jonathan antin
fred baron
codswallop definition
stevie nicks
www.netnewsticker.com arc energy
ego ferguson
kim burrell
reserveamerica
ivan mccartney
mashable.com national lady gaga day
ipad tablet
ipad thoughts
doppelganger week facebook
tebow super bowl ad

Few remarks:

  • All the links spotted by my prototype and that appear in the table are coming from real tweets around those google hot trends queries.
  • You’ll notice that apple iPad announcement is a theme that was covered by 4 of those top 10 websites!
  • I recommend you to have a look on the youtube video in the table around the google hot trend “ipad a disappointment” :) .
  • I also recommend you to have a look at the haiti 360 view covered by cnn.
  • For twitpic, it is only pics, so what you’ll find there is a sample of “trendy pics” (see below for more on that…)
  • Sometimes the hot query seems to be not connected with the related article at first view (like with fred baron). But when you take a closer look, there is always a connection! This is not for nothing that people tweet about a link with the text of the hot query in the tweet…

To finish, find below a picasa collage that I built using the most cited twitpic pictures in twitter for this past week of hot trends (not only the 5 cited in the table). You’ll identify easily some sarcastic pictures before the iPad announcement or pics around the election of Miss USA. Click the picture to enlarge.

picasaCollageTopPics

Collage of the most cited twitpic links in twitter for a week of google hot trends (Click to enlarge)

If you’re curious to map some pictures with its related hot topic, click the collage to enlarge it and try to guess which pics correspond to which google hot query below :) .

miss america 2010 winner, what celeb do i look like, miss america 2010, roberta flack, lady gaga and elton john, addicted to love, jim florentine, apple itablet, lost season 6 premiere, candy crowley, to make you feel my love, swagger crew, footprints in the sand, gasparilla, miss virginia, duke georgetown, celebrity look alike, katherine putnam, itablet, andrea bocelli, monster diesel, peta ad.

Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner

combinerWelcome to the fourth issue of the Hadoop Tutorial Series. Combiners are another important Hadoop’s feature that every hadoop developer should be aware of. The primary goal of combiners is to optimize/minimize the number of key value pairs that will be shuffled accross the network between mappers and reducers and thus to save as most bandwidth as possible.

Indeed, to give you the intuition of why combiner helps reducing the number of data sent to the reducers, imagine the word count example on a text containing one million times the word “the”. Without combiner the mapper will send one million key/value pairs of the form <the,1>. With combiners, it will potentially send much less key/value pairs of the form <the,N> with N a number potentially much bigger than 1. That’s just the intuition (see the references at the end of the post for more details).

Simply speaking a combiner can be considered as a “mini reducer” that will be applied potentially several times still during the map phase before to send the new (hopefully reduced) set of key/value pairs to the reducer(s). This is why a combiner must implement the Reducer interface (or extend the Reducer class as of hadoop 0.20).

In general you can even use the same reducer method as both your reducer and your combiner. This is the case for the word count example where using a combiner remains to add a single line of code in your main method:

conf.setCombinerClass(Reduce.class);

where conf is your JobConf, or, if you use hadoop 0.20.1:

job.setCombinerClass(Reduce.class);

where job is your Job built with a customized Configuration.

That sounds pretty simple and useful and at first look you would be ready to use combiners all the time by adding this simple line, but there is a small catch. The first kind of reducers that comes naturally as a counter example of using combiner is the “mean reducer” that computes the mean of all the values associated with an given key.

Indeed, suppose 5 key/value pairs emitted from the mapper for a given key k: <k,40>, <k,30>, <k,20>, <k,2>, <k,8>. Without combiner, when the reducer will receive the list <k,{40,30,20,2,8}>, the mean output will be 20, but if a combiner were applied before on the two sets (<k,40>, <k,30>, <k,20>) and (<k,2>, <k,8>) separately, then the reducer would have received the list <k,{30,5}> and the output would have been different (17.5) which is an unexpected behavior.

More generally, combiners can be used when the function you want to apply is both commutative and associative (that’s pretty intuitive to understand why). That’s the case for the addition function, this is why the word count example can benefit from combiners but not for the mean function (which is not associative as shown in the counter example above).

Note that for the mean function you can use a workaround for using combiners by using two separate reduce methods, a first one that would be used as the addition function (and thus that can be set as the combiner) that would emit the intermediate sum as the key and the number of addition involved as the value, and a second reduce function that would compute the mean by taking into account the number of addition involved (see the references for more details on that).

As usual in this series, let’s observe the lesson learned in action using our learning playground. For that you can use the original word count example (or its hadoop 0.20.1 version that we used in the previous issue), add it the single combine line as specified earlier in the post and run it on our moby-dick mascot. Here what we can see at the end of the execution:

combiner

Output of the word count example when using a combiner. Click to enlarge.

Now that you understand what counters are, if you click to enlarge the picture, you’ll see the value of two counters: Combine input records=215137 and Combine output records=33783. That’s a pretty serious reduction of the number of key/value pairs to send to the reducers. You can easily imagine the impact for much larger jobs (see the reference below for real numbers).

Enjoy combiners, whenever you can…

References

  • See the 4th tip of this must read blog post by Todd Lipcon for feeling better the benefit of combiners on a 40GB wordcount job benchmark.
  • For a deeper understanding of when and how combiners are used in the mapReduce data flow, check this section of the (quiet heavy but) excellent Yahoo! hadoop tutorial.
  • To extend the intuition given in the post on why combiners help, you can go over this walk-through.
  • Both Hadoop the definitive guide and Hadoop in Action contains interesting information on combiners (part of both of them inspired this post). In particular the first contains a great section on when exactly the combiners comes into play in the mapReduce data flow. The second contains a full code of the mean function workaround mentioned above.

Hadoop Tutorial Series, Issue #3: Counters In Action

Note: This post has been updated with a code working for hadoop 0.20.1.

In this 3rd issue of the hadoop tutorial series, we’ll speak about a very simple but very useful hadoop’s feature: counters.

Even if you have never defined any counters in hadoop, you can see some of them each time you are running an hadoop job. Indeed, here is what you can see from the client console at the end of the execution of a job (can also be seen from the web interface):

counters

Hadoop internal counters at the end of a job (Click to enlarge).

As you can see, 18 internal counters are presented inside different groups. For instance, you can see a section “Job Counters” with three different counters giving basic information about the job like the number of mappers and reducers. In that example, “Job Counters” is called the group of the counter and “Launched reduce tasks” (for instance) is properly the name of the counter.

It is very handy to define your own counters to track any kind of statistics about the records you are manipulating in the mapper and the reducer. The most natural use of that is to use counters to track the number of malformed records.

If you are executing a job  and you see an abnormally high number of malformed records, it can give a good hint that you perhaps have a bug in your code or some problem with your data (note this is actually a much simpler way to spot issues than tracking error messages in a distributed set of log files). But you can actually use counters for any kind of other statistics on your records.

One easy way to define your own counters from your Java code is:

  • Declaring an enum representing your counters. The enum name is the group of the counter, and each field of the enum is the name of the counter that will be reported in this same group
  • Incrementing the desired counters from your map and reduce methods through the Context of your mapper or reducer (in previous hadoop version it was through the Reporter.incrCounter() method, but the reporter no longer exists in hadoop 0.20)

So let’s see an example. We’ll take the word count example revised for version 0.20.1 to illustrate the use of counters. We will create a counter group called WordsNature that will count how many unique tokens there is in all, how many unique tokens starts with a digit and how many unique tokens starts with a letter.

So our enum declaration will look like that:

 static enum WordsNature { STARTS_WITH_DIGIT, STARTS_WITH_LETTER, ALL }

We will also need a very basic StringUtils class:

package com.philippeadjiman.hadooptraining;
 
public class StringUtils {
 
	public static boolean startsWithDigit(String s){
		if( s == null || s.length() == 0 )
			return false;
 
		return Character.isDigit(s.charAt(0));
	}
 
	public static boolean startsWithLetter(String s){
		if( s == null || s.length() == 0 )
			return false;
 
		return Character.isLetter(s.charAt(0));
	}
 
}

Since we are interested in unique tokens, we will put the code related with the counter into the reduce method. So here how the reduce method will look like:

public void reduce(Text key, Iterable values, Context context)
	throws IOException, InterruptedException {
 
	int sum = 0;
	String token = key.toString();
	if( StringUtils.startsWithDigit(token) ){
		context.getCounter(WordsNature.STARTS_WITH_DIGIT).increment(1);
	}
	else if( StringUtils.startsWithLetter(token) ){
		context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);
	}
	context.getCounter(WordsNature.ALL).increment(1);
	for (IntWritable value : values) {
		sum += value.get();
	}
	context.write(key, new IntWritable(sum));
}

Here is the code of the WordCountWithCounter that include this code.

If you want to run it inside our learning playground you’ll just have to update the pom with hadoop latest version:

<dependency>
<groupId>org.apache.mahout.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.1</version>
</dependency>

So here is the result after running the code with, as input, the whole text of moby dick:

jobResultsWithCounters

We can now see our home made counters. (Click to enlarge)

So we can see now that we have 33783 unique tokens, 32511 starting with a letter and 263 starting with a digit. What about the 1009 others?? Well, because the word count example use a basic StringTokenizer that splits tokens at spaces, a lot of words simply starts with a ‘(‘ or with a ‘[‘ and even with ‘–’. To solve that you can for instance use a lucene StandardAnalyzer.

You should now be able to easily implements your own counters for tracking bad records/missing values, debugging or gathering any kind of statistics around your job.

See you soon for another issue…

How To Build A Relevant Real Time Search Engine Prototype In Few Hundreds Lines Of Code

gootterBy the end of the post you’ll find the code along with a small command line JAVA program to play with, but let me first describe the specifications of the real time search engine prototype that I’m targeting here.

Basically it should take as input a  search query and return as output a ranked set of URLs that would correspond to the latest hot news around that search query.

In some way it is similar to what you would expect to find on google news or in one of the dozens real time search engine that were released last year (let’s cite oneriot, crowdeye and collecta).

The goal of my prototype is to demonstrate how to leverage twitter and a simple ranking algorithm to obtain most of the time relevant URLs in response of hot queries, without having to crawl a single web page! As my primary target is relevancy, I won’t invest any effort on performance or scalability of the prototype (retrieved results will be build at query time).

High level description of the prototype

Basically what I did is to use the twitter API through a java library called twitter4j to retrieve all the latest tweets containing the input query and that contains a link. For very hot queries, you’re likely to get a lot of those (I put a limit of the last 150 but you’ll be able to change it). Once I got my “link farm”, what I do is to build a basic ranking algorithm that would rank first the URLs that are the most referenced.

As most of the URLs in tweets are shortened URLs, the trick is to spot the same URLs that were shortened by different shortening services. For instance both of the following shortened URLs points to a same page of my blog: http://tinyurl.com/yajkgeg and http://bit.ly/SmHw6. It can sounds as a corner case but it actually happens all the time on hot queries. So the idea is to convert all the short URLs in their expanded version. To see how to write an universal URL expander in JAVA that would work for the 90 + existing URL shortening services check the post that is referenced by the two short URLs above.

Note that you can improve the ranking algorithm in tons of way, by exploiting the text in the tweets or who actually wrote the tweet (reputation) or using other sources like digg and much more, but as we’ll see, even in its simplest form, the ranking algorithm presented above works pretty well.

Playing with some hot queries

To find some hot queries to play with, you can for instance take one of the google hot trends queries (unfortunately down from 100 to 40 to 20). Let’s try with a very hot topic while I’m writing this post: the google Nexus One phone that was about to be presented to the press two days after I started to wrote this post.

Below I have compiled the results obtained respectively by Google News, OneRiot and my toy prototype on the query “nexus one”. Click the picture to enlarge.

OneRiotGoogleNewsProto_NexusOne.jpg

Comparing the results on Nexus One. Click to enlarge.

I hope you enjoyed my killer UI :) . But let’s focus on the three URLs corresponding of the first result of each one:

Given the fact that at the time I issued the query, the Nexus one was not yet released, I would say that the article that the prototype found is the best one since it is the only one that present an exclusive video demonstrating the not yet released phone. This is also why so much people were twitting about this link: because it was the best at that precise time! We’ll see even more in the next section.

Before, let’s try with another hot query today (in the top 20 hottest queries of google hot trends): “byron de la beckwith”.

That time, it is not clear what is the story/news hidden behind that hot query but running it on the prototype gives as the first link the article below (click on the picture if you want to see the full article).

byron de la beckwith

First ranked result by the prototype for the query "byron de la beckwith". Click to follow the article.

Again this is a very relevant result (oneRiot and Google News gave the same one at that time).

The temporal aspect of hot queries

What is interesting with hot queries is that you expect the result to change even within a short amount of time. Indeed, any story or breaking news generally evolve as new elements comes in. As promised let’s follow our “nexus one” query.

In the previous section, the prototype’s first result was a very relevant article from engadget. I relaunched the same query, but after 12 hours. The first ranked result returned by my prototype gives me now a different result: still another article from engadget (see picture below), but that time with a much more in depth review of the phone with more videos including a very funny comparison between the android, iphone and nexus one.

Then I waited for Google doing its press conference one day later. I issued the query again. Can you guess what was the first link given by my prototype? You got it, the official Google Nexus One website.

nexusOneAfter12hours

The first link given by the prototype on "nexus one" about one day before its official presentation by Google. Click to follow the article

Again this is not a corner case. This temporal aspect happens all the time, for any type of breaking news or events. As a last example of that phenomenon, let’s take the movie avatar. The first days before and after that the movie were released, all you got is links to see the trailer or even the movie. Now, few weeks after, what you get is a very fast changing list of links around fun pictures of parodies of the movie with title like “Do you want to date my avatar” (picture below) or a letter attempting to prove that avatar is  actually Pocahontas in 3d :) .

wantYouDataMyAvatar

Few weeks after the release of the Avatar movie, first links are a fast changing list of parodies

Playing by yourself with the prototype

If you just want to run the prototype through the command line

You must  have java 6 installed (you can check by opening a console and type java -version). On recent mac, see those instructions for having java 6 ready to use in a snap.
Then just download this zip archive: jarsDependencies.zip.
Save it and extract it somewhere in your computer. It will create a directory named prototypeJars.
Open a command prompt. Go inside the directory prototypeJars.

If you are on windows, just type:

java -cp "*;" com.philippeadjiman.rtseproto.RealTimeSEPrototype "nexus one" 150 OFF

If you are on Linux or Mac just type:

java -cp "*" com.philippeadjiman.rtseproto.RealTimeSEPrototype "nexus one" 150 OFF

You’ll notice the three last arguments (all are mandatory):

  • “nexus one”: is the query. Type whatever you want here but keep the quotes.
  • 150: is the maximum number of tweets to retrieve from the timeline. Put whatever number between 1 and 1000 but 150 is good enough.
  • OFF: whether or not you want the prototype to expand the short URLs. If you put ON, you should be patient, it may take a while. Even if duplicate short URLs happen all the time, going with OFF gives a good approximation of which are the leading results. Unless a problem with Twitter, putting OFF should provide you the results within few seconds.

Only the top 20 first results will be printed.

If you want to play with the code

As the title suggests, that just few hundreds lines of (JAVA) code. As it is a toy project and to keep things simple I voluntarily didn’t use any DI framework like spring or guice and tried to use as less external libraries as possible unless necessary (even no log4j!). I did wrote a minimal amount of unit tests since I cannot code without it and I did use the google-collections library for the same reason :) .

Also I tried to wrote at least a minimal amount of comments, in particular where I think the code should be improved a lot for better performance but remember: the prototype is of course not scalable as it does not rely on any indexing strategy (it computes the results at query time). Building a real a real search engine would at first involve building an index offline (using lucene for instance).

You’ll find the source code here prototype_src.zip.

If you are using maven and eclipse (or other popular IDE), you should be ready to go in less than a minute by unpacking the zip, typing “mvn eclipse:eclipse” and importing the existing project.

Some final remarks

What I wanted to prove here is mainly that without crawling a single webpage, you can answer to “hot queries” with a relevancy comparable to what you can find on google news or any “real time search engine”. This is made possible by judiciously using the tremendous power that twitter provide with its open API.

Of course building a real “real time search engine” would require much more than few hundred lines of code :) and hundreds of features could be added to that prototype, but I would keep two core principles:

  • real time search results should be links and not micro blogging text like tweets. The text of some tweets can be relevant but as a secondary level of information.
  • let the “real time crowd” do the ranking for you. If a link is related in some way with your query and was highly and recently tweeted or digged (you name it), then there is a good chance that it will be a relevant “real time” result.

In that sense, among the dozens of real time search engines I have tested, my favorite one remains oneriot.

This is for the “pull” side of the things (when the user knows what to search for). I did not talk about the “push” side of the real time web here, probably in another post…

If you have issues running the prototype or any other question/remark, please shoot a comment.

Hadoop Tutorial Series, Issue #2: Getting Started With (Customized) Partitioning

In the Issue #1 of this series, we set up the “learning playground” (based on the Cloudera Virtual Machine) in order to enjoy hands-on learning experiences around Hadoop. In this issue, we’ll use our playground to investigate the partitioning features offered by Hadoop.

What is it all about?

As you may know, a map/reduce job will contains most of the time more than  1 reducer.  So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning (the key-value pairs space is partitioned among the reducers). A Partitioner is responsible to perform the partitioning.

In Hadoop, the default partitioner is HashPartitioner, which hashes a record’s key to determine which partition (and thus which reducer) the record belongs in.The number of partition is then equal to the number of reduce tasks for the job.

Why is it important?

First, it has a direct impact on the overall performance of your job: a poorly designed partitioning function will not evenly distributes the charge over the reducers, potentially loosing all the interest of the map/reduce distributed infrastructure.

Second, it maybe sometimes necessary to control the key/value pairs partitioning over the reducers. Let’s illustrate it on a simple example. Suppose that your job’s input is a (huge) set of tokens and their number of occurrences (for instance the output of the canonical word count hadoop example) and that you want to sort them by number of occurrences. Let’s also suppose that your job will be handled by 2 reducers. If you run your job without using any customized partitioner, you’ll get something like this:

partialSortOn2Reducers

(Click to enlarge)

As you can see, the tokens are correctly ordered by number of occurrences on each reducer (which is what hadoop guarantees by default) but this is not what you need! You’d rather expect something like:

TotalSortOn2Reducers

(Click to enlarge)

where tokens are totally ordered over the reducers, from 1 to 30 occurrences on the first reducer and from 31 to 14620 on the second. This would happen as a result of a correct partitioning function: all the tokens having a number of occurrences inferior to N (here 30) are sent  to reducer 1 and the others are sent to reducer 2, resulting in two partitions. Since the tokens are sorted on each partition, you get the expected total order on the number of occurrences.

Below, we’ll use our playground to observe the issue happening  on real data and see how we solve it using customized partitioners.

Also, as a second example of use of customized partitioning functions, let’s cite the original map/reduce google paper: “sometimes the output keys are URLs, and we want all entries for a single host to end up in the same output. To support situations like this, the user of the MapReduce library can provide a special partitioning function. For example, using “hash(Hostname(urlkey)) mod R” as the partitioning function causes all URLs from the same host to end up in the same output”.

Feeling the partitions in our playground

If your playground is not yet set up, check the Issue #1 of this series. As an input for our job, we’ll use a tsv file containing the list of tokens and their number of occurrences extracted from (once again) the full moby dick text. Click here to download this input. You’ll notice that the pairs (tokens, #occurrences) are alphanumerically sorted on tokens value.

First, we’ll use a very simple pre-processing job to transform the input data into a more convenient format to use within hadoop: the Sequence File Output Format. Sequence files are a basic file based data structure persisting the key/value pairs in a binary format and allowing you to interact more easily with basic hadoop data types (e.g IntWritable, LongWritable, etc…). Here is the simple pre-processing job:

package com.philippeadjiman.hadooptraining;
 
import java.io.IOException;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileOutputFormat;
 
public class SortDataPreprocessor {
 
	static class PreprocessorMapper extends MapReduceBase implements Mapper {
 
		private Text word = new Text();
 
		public void map(LongWritable key, Text value,
				OutputCollector output, Reporter reporter) throws IOException {
			String line = value.toString();
			String[] tokens = line.split("t");
			if( tokens == null || tokens.length != 2 ){
				System.err.print("Problem with input line: "+line+"n");
				return;
			}
			int nbOccurences = Integer.parseInt(tokens[1]);
			word.set(tokens[0]);
			output.collect(new IntWritable(nbOccurences),word );
		}
	}
 
	public static void main(String[] args) throws IOException {
		JobConf conf = new JobConf(SortDataPreprocessor.class);
 
		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
		conf.setMapperClass(PreprocessorMapper.class);
		conf.setOutputKeyClass(IntWritable.class);
		conf.setOutputValueClass(Text.class);
		conf.setNumReduceTasks(0);
		conf.setOutputFormat(SequenceFileOutputFormat.class);
		JobClient.runJob(conf);
	}
}

You’ll notice that:

  • it contains only a mapper (no reducer),
  • a basic error management is performed for potential malformed lines,
  • the output key is the number of occurrences (as an IntWritable) and the output value is the associated token,
  • The sequence file output format is specified using setOutputFormat(SequenceFileOutputFormat.class);

To run it, package the job using maven (see Issue #1), put the input file on hdfs in an input directory (let’s call it input) and execute:

hadoop jar playing-with-partitions.jar com.philippeadjiman.hadooptraining.SortDataPreprocessor /user/training/input /user/training/pre_process

This will create a directory called “pre_process” on hdfs containing a set of pairs (#occurrences,token), respectively of format IntWritable and Text, in a SequenceFileOutputFormat.

Now we can, perform the sort based on this new input. Writing a job for such a task is actually trivial since this is primarily what hadoop is doing by default, so here it is:

package com.philippeadjiman.hadooptraining;
 
import java.io.IOException;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
 
public class SortExample {
	public static void main(String[] args) throws IOException {
		JobConf conf = new JobConf(SortExample.class);
		conf.setJobName("sortexample");
 
		FileInputFormat.setInputPaths(conf, new Path(args[0]));
		FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
		conf.setInputFormat(SequenceFileInputFormat.class);
		conf.setOutputKeyClass(IntWritable.class);
		conf.setOutputValueClass(Text.class);
 
		conf.setNumReduceTasks(2);
 
		JobClient.runJob(conf);
	}
}

You’ll notice that:

  • There is neither map nor reduce methods! This is because sorting is a default behavior so we don’t have to do anything (we’re just interested here to see how it’ll be partitioned),
  • The input/output formats are specified based on the output of our pre-processing job,
  • We explicitly set the number of reducer to 2, which is the important part here since we want to observe how the output will be partitioned (without specifying it, the output will be generated using only one reducer).

Just run it using:

hadoop jar playing-with-partitions.jar com.philippeadjiman.hadooptraining.SortExample /user/training/pre_process /user/training/output

Once completed, an output directory will be created on hdfs with two files, one for each reducer that were used. You can observe the content of the output using commands like:

hadoop fs -cat output/part-00000 | less
hadoop fs -cat output/part-00001 | less

As you’ll see, the two outputs are sorted but do not represent a total order, as explained above. Let’s fix it.

How to implement your own partitioning function

So how do we create a total order out of those two reducers?

A first solution is to create our own partitionner which is as simple as implementing the Partitioner<K,V> interface:

package com.philippeadjiman.hadooptraining;
 
package com.philippeadjiman.hadooptraining;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
 
public class MyPartitioner implements Partitioner<IntWritable,Text> {
	@Override
	public int getPartition(IntWritable key, Text value, int numPartitions) {
		/* Pretty ugly hard coded partitioning function. Don't do that in practice, it is just for the sake of understanding. */
		int nbOccurences = key.get();
 
		if( nbOccurences < 3 )
			return 0;
		else
			return 1;
	}
 
	@Override
	public void configure(JobConf arg0) {
 
	}
 
}

This implementation of getPartition specifies to put all the pairs having a key (which is here the number of occurrences) being less than 3 into the first partition and the other into the second one. This is of course a pretty bad practice to hard code like that a partitioning function but this is simply for the sake of understanding.

To use this created partition just add the following line to the main method of the previous SortExample class:

conf.setPartitionerClass(MyPartitioner.class);

Why did I choose 3? Because as a side effect of the Zipf law, the number of tokens having a number of occurrences of 1 and 2 will be as much as big that all the others together! (to see a Zipf Law in action, check this post). So 3 was chosen just for balancing a little bit the partitions.

Re-package the code with the customized partition, remove the old output, run it again and check that our problem is solved: there is a total order over the partitions.

How to automatically find “good” partitioning function using sampling

Now, as I mentioned above, it is a pretty bad practice to hard code how to partition the keys. But on the other hand, how would I know automatically and in advance how to divide the partition in the general case?

Hadoop provide a nice way to approximate a priori a good partitioning function using an InputSampler. For instance, a RandomSampler, will sample the input at random to estimate what is the best way to partition. The sampler will wrote into a file called, by default, _patition.lst describing the partition that the job will automatically use to decide which key/value pairs to send to which reducers. This mechanism has to be used in combination with a TotalOrderPartitioner.

Here is a code sample using such a sampler with a total order partitioner:

InputSampler.Sampler<IntWritable, Text> sampler =
	new InputSampler.RandomSampler<IntWritable, Text>(0.1, 100);
InputSampler.writePartitionFile(conf, sampler);
conf.setPartitionerClass(TotalOrderPartitioner.class);

Sometimes, there are some issues with the file _partition.lst that is not found. It always worked for me when I specified explicitly where to find the file using the TotalOrderPartitioner.setPartitionFile(); method. Also pay attention to invoke this method before the call to writePartitionFile. Also, note that the sampling mechanism is necessary since considering all the input to compute the partition would be inefficient for large files.

Some remarks

  • A customized partitioning would not have been necessary if we had only one reducer since all the key/value pairs would have ended into the same output file. It is easy to understand that such a constraint is a nonsense and that using more than one reducer is most of the time necessary, else the map/reduce concept would not be very useful…
  • Even if we used a small dataset on the semi-distributed infrastructure of the cloudera virtual machine to observe partitioning in action and to learn how to customize it, the same concepts can be applied to a larger infrastructure. To see a very interesting use case of customized partitioning strategy for sorting purpose on a big infrastructure, check the famous TeraByte sort on hadoop.

Conclusion

Partitioning in map/reduce is a fairly simple concept but that is important to get correctly. Most of the time, the default partitioning based on an hash function can be sufficient. But as we illustrated in this Issue, you’ll need sometime to modify the default behavior and to customize your own partitioning suited for your needs.

If you have some questions, or if you have experimented other use cases of customized partitioning in your application, please comment/share. See you soon for another Issue.

Hadoop Tutorial Series, Issue #1: Setting Up Your MapReduce Learning Playground

hadoop-logo

Update: Instructions updated for hadoop 0.20.2.
This is the first post of a series of small hadoop tutorials introducing progressively core hadoop functionnalities. You might be interested in that series if you recognized yourself in one or more of the following points :

  • You’ve heard about the basics of MapReduce (else check the links that I recommend at the end of this post)
  • Even if you’re not working at Google, Yahoo, Facebook (or many others) for now, you know it’s been years that MapReduce/Hadoop has become a must-have skill and that you should practice it but you have very few time
  • You did try to read some tutorials but it was always either not hands-on enough or too much detailed

This first post is dedicated to build what I called a “MapReduce Learning Playground”: for practice or for a real need, you read or wrote  on a sheet of paper the map and reduce functions that might solve a particular problem and you want to see it in action, not necessarily on huge data sets, just check that it computes the correct answer.

A lot of material can be found on the internet to do the same. The steps below are my attempt to present the best part of all the training material that I read on the subject, adapt it, adding it some glue (here with maven) and compile the whole into something that I hope will save you some time.

Step 1: Install the cloudera training virtual machine

Cloudera is really doing a great job at providing training material for hadoop. The most useful one in my opinion is their hadoop training virtual machine. Update: they changed a lot of things since that post was written, here is a better link for a training virtual machine (Thanks Karthick for letting me know that the old link was broken ). It provides a VMWare image of a Linux Ubuntu distribution with a pre-installed hadoop cluster in Pseudo-Distributed Mode.

To install the VM on your computer, just follow their instructions, it is free (except if you’re on Mac) and very easy. The VM comes also with hadoop related tools already installed like hive and pig (it will probably be the subject of other posts).

At the end of the installation, open the VMWare Player, start the cloudera VM (with training/training as user/pass) and you should get something like this:

cloudera-hadoop-vm-training

The cloudera hadoop virtual machine for training (click to enlarge)

Step 2: Creating an “hadoop ready” project with maven

Cloudera does provide some training projects already mounted in the eclipse installed in the VM but those projects contains several small errors (like missing dependencies). Even if those are very easily fixed, I describe here how to build your own project from scratch; it will give you a better basis in case you want to extend them and you’ll always be able to copy-paste the map-reduce functions of some interesting cloudera training projects into your own ones.

First install maven on the VM. Open a terminal from the VM and type:

sudo apt-get install maven2

If for some reasons you run into trouble with the installation of maven, you can always download it directly from here. Assuming you are unzipping it into the /user/local/apache-maven directory, you can add those lines into your /home/training/.bashrc configuration file:

export M2_HOME=/usr/local/apache-maven
export M2=$M2_HOME/bin
export PATH=$M2:$PATH

Then from the same terminal go into the workspace directory (usually located at ~/workspace) and create a java project hierarchy using the following maven command (change the groupId and the artifactId as you like):

mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes  -DgroupId=com.philippeadjiman.hadooptraining -DartifactId=hadoop-first-example

Then enter into the hadoop-first-example directory and generate the necessary files for eclipse:

mvn eclipse:eclipse

Then open eclipse from the VM then File -> Import -> Existing Projects into Workspace -> Browse, choose the hadoop-first-example directory, OK -> Finish. Then you should see your project on the left side.

You may have an error on a M2_REPO unresolved variable, that’s OK, it’s because it is the first time that this eclipse use a maven project. To fix it, just right click on your project -> Build Path -> Configure Build Path -> Add Variable -> Configure Variables -> New. In the name type M2_REPO and in the path type /home/training/.m2/repository (just check that this directory exists).

Then you’ll have to add the hadoop jar dependency. To do so, you just have to open you pom.xml file (you’ll see it at the bottom of your project) and the following dependency (add it just before the </dependencies> closing tag ):

<dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-core</artifactId>

<version>0.20.2</version>

</dependency>

You can check here if there is a newer version. Note also that if you plan to use hadoop to run with a specific framework built on top of it then make sure you’re using the right version. E.g. for mahout, use the version you’ll find here.

Then you can go back to the terminal in your hadoop-first-example directory and type again mvn eclipse:eclipse to regenerate the eclipse files with now the hadoop dependency. You can now refresh your directory. You have now an “hadoop-ready” project.

Step 3: Put your map reduce program into your project and prepare the data on HDFS

The first time you heard about MapReduce, there is a good chance that you also heard about the word count example. The wordCount code on hadoop website is quiet outdated for hadoop v0.20, here is a link to a blog post with a more updated version of the word count code that will work with the 0.20.2 hadoop version used in that tutorial (Thanks Yi!). Be sure to put that code into the src directory of your project into a package called com.philippeadjiman.hadooptraining (or whatever else, as long as it matches to your package declaration).

Then before to deploy the job, you’ll have to count some words. Following a moby dick tradition in this blog,  let’s download the full english raw text of moby dick that you can  find here.  If you want to run the hadoop job on it, you’ll have to put this file on HDFS, the underlying hadoop file system (check the “useful links” section below if you’ve never heard about HDFS).

Navigating, reading from and writing to HDFS is super simple and if you’re already familiar with regular unix file system commands  then you’ll get it instantly: almost all the commands are the same, you just have to pass them to the wrapper command hadoop with ‘fs’ as argument and a ‘-’ before the command. Examples (to run from a regular terminal):

hadoop fs -help # will print all the command that you can execute on the HDFS
hadoop fs -ls  # will perform an ls from the HDFS home directory (set to /user/training in the VM).
hadoop fs -mkdir input # will create the directory 'input' in the HDFS home directory (check if it does not already exists)
hadoop fs -mkdir output # will create the directory 'input' in the HDFS home directory (check if it does not already exists)
hadoop fs -put mobyDick.txt input # will put your local copy of mobyDick into the directory 'input' on the HDFS

Step 4: package your job, run it, observe the result

To launch your job on the hadoop infrastructure, you’ll have to package it into a jar file. With maven, nothing is more simple. Just go into your hadoop-first-example directory and type:

mvn jar:jar

This will generate a jar into a sub-directory called target. Your jar will have a name like hadoop-first-example-1.0-SNAPSHOT.jar (you can change that generated name by editing the jar section of your pom). You can check that your jar file contains as expected the WordCount.class (and its inner classes) by typing:

jar -tf hadoop-first-example-1.0-SNAPSHOT.jar

You can now launch your hadoop job by executing the following command (adapt it with the correct package name if necessary):

hadoop jar hadoop-first-example-1.0-SNAPSHOT.jar com.philippeadjiman.hadooptraining.WordCount /user/training/input /user/training/output

Cloudera also comes with an handy web interface allowing to, among other things, monitor the jobs running on the cluster. Just open firefox and go to the url http://localhost:8088/. From the menu at the top right side of the page, choose job browser and you should see the status of your job. You can also click on it to see the details (status and number of mapper/reducer of your job):

cloudera-web-interface

Using the cloudera web interface to see your job status (click to enlarge)

You can also see your job output from there but I found it easier to check it directly from HDFS:

hadoop fs -ls output # should now contains something (a file named part-00000 should be your output)
hadoop fs -cat output/part-00000 | less # will let you browse easily your output

Important: if you want to run your job another time, you’ll have to first erase all the current files of the output:

hadoop fs -rmr output # erase all the files contained in the output directory

You’ll notice that the output file contains many words with the ” character and other similar noise. This is of course because a simple StringTokenizer is used in the map function. To parse the text correctly, consider using some standard analyzers from Lucene for instance (you have an example in the code of step 2 of this post).

Step 5: Modify, Customize, Play, Learn. Now start the real fun…

Now that you built and deployed you own project from scratch, you have all you needs to modify certain parts of the code, create new map/reduce programs, test methods from the hadoop api, observe the results.

You can for instance try to see what happens if you use more than one reducer in the word count job (using conf.setNumReduceTasks(2) in the main method). Which lines are sent to which reducer? How to control that? How to sort the output of word count by number of occurrence (highest number first)?

Also I recommend to go over this tutorial that shows how to build an inverted index (see the theory here) using map/reduce (the tutorial contains many broken references w.r.t. the cloudera VM but you don’t care since now you know what you’re doing ;) )

Future posts of this series will leverage the playground we built here to illustrate and learn about other interesting stuff around hadoop.

Useful links:

hadoop-first-example-1.0-SNAPSHOT

Flexible Collaborative Filtering In JAVA With Mahout Taste

Mahout-logo-164x200 I recently had to build quickly a prototype of recommendation engine for a promising start-up company. I wanted to first test state of the art collaborative filtering algorithms before to build a customized solution (potentially on top of those algorithms). Most importantly, I wanted to be able to compare quickly all the different algorithm configurations with which I would come up with. Mahout Taste (previously a sourceforge project but recently promoted to the Apache Mahout project) was simply answering all those needs in one place.

I describe below how in few easy steps, I was set up to express my creativity without having to reinvent the wheel. This tutorial is based on the 0.2 release of Mahout.

Step 1: Set up your environment with mahout taste

I usually use Eclipse with Maven to simply add a dependency but the mahout pom had some repository issues by the time I tried, so I worked around it by adding the required libraries in eclipse manually (all the libraries found in the directory lucene/mahout/trunk/core/target/dependency of their latest release).

Step 2: Familiarize yourself by building a simple recommendation engine based on the movie lens data

To see a recommender engine in action, you can for instance download one of the movie Lens ratings data sets (I choose the one with one million ratings). Unzip the archive somewhere. The file that will interest you is ratings.dat. Its format is as follows:

userId::movieId::rating::timestamp

The basic mahout taste FileDataModel only accept the simple following format:

userId,movieId,rating

There are many ways to transform your original file in that format, I used the simple following perl command:

perl -F"::" -alne 'print "@F[0],@F[1],@F[2]"' ratings.dat > ratingsForMahout.dat

You think: “what about the timestamp information???”. Yes, you right, it is a pretty crucial information given that it is based on temporal dynamics that the winning team of the Netflix prize made the difference (BTW, if you’re interested in the subject, you must see this video of Yehuda Koren’s lecture at KDD).

So, don’t worry, you can customize later your own DataModel class that parse any information you want, you’ll just have to implement the DataModel interface (you can also extends the FileDataModel class).

To obtain your first recommendations in few lines of code, you can use

import java.io.File;
import java.io.FileNotFoundException;
import java.util.List;
 
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
 
public class MahoutPlaying {
	public static void main(String[] args) throws FileNotFoundException, TasteException {
		DataModel model;
		model = new FileDataModel(new File("/home/padjiman/data/movieLens/mahout/ratingsForMahout.dat"));
		CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(model));
 
		List recommendations = cachingRecommender.recommend(1, 10);
		for (RecommendedItem recommendedItem : recommendations) {
			System.out.println(recommendedItem);
		}
 
	}
}

which creates in few lines of code a slope one recommendation engine and print the 10 first recommendations for user 1. You’ll see there only movieIds so you’ll have to check the file movies.dat to see the actual movie title (you can also write a simple method or script that shows you directly the movie title if you want to play with several users or to create your own user).

You can replace the slope one recommender with whatever other recommendation engine provided in the package. For instance, let’s say you want to use a classic user based recommender algorithm using the Pearson correlation similarity with a nearest 3 users neighborhood, simply replace the line that build the recommender in the above code by the code below:

UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(3, userSimilarity, model);
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);

Few issues you might have during step 2:
- OutOfMemoryError: the slope recommender is pretty greedy and on the 1 Million Movie Lens Dataset, you may have to set the -Xmx VM option to 1024m (in eclipse, just add -Xmx1024m to the VM arguments in the run configuration options).
- Some errors during the FileDataModel initialization: pay attention that the directory containing your file to parse does not contains other files starting with the same name; for some reasons it disturbs the DataModel initialization in some cases.

Step 3: Test the relevance of the algorithms

In my opinion the most valuable part of the whole process. To feel immediately if your intuition of choosing a particular algorithm is a good one, or to see the good or bad impact of your own customized algorithm, you need a way to evaluate and compare them on the data.

You can easily do that with mahout RecommenderEvaluator interface. Two different implementations of that interface are given: AverageAbsoluteDifferenceRecommenderEvaluator and RMSRecommenderEvaluator. The first one is the average absolute difference between predicted and actual ratings for users and the second one is the classic RMSE (a.k.a. RMSD).

Since I’m playing with a movie dataset and that Netflix evaluation process was based on RMSE, I put here an example of use of the RMSRecommenderEvaluator:

import java.io.File;
import java.io.IOException;
 
import org.apache.commons.cli2.OptionException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.RMSRecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.Recommender;
 
public final class EvaluationExample{
	public static void main(String... args) throws IOException, TasteException, OptionException {
 
		RecommenderBuilder builder = new RecommenderBuilder() {
			public Recommender buildRecommender(DataModel model) throws TasteException{
				//build here whatever existing or customized recommendation algorithm
				return new CachingRecommender(new SlopeOneRecommender(model));
			}
		};
 
		RecommenderEvaluator evaluator = new RMSRecommenderEvaluator();
		DataModel model = new FileDataModel(new File("/home/padjiman/data/movieLens/mahout/ratingsForMahout.dat"));
		double score = evaluator.evaluate(builder,
				null,
				model,
				0.9,
				1);
 
		System.out.println(score);
	}
}

Note that the evaluator need a RecommenderBuilder provided here as an inline implementation of the interface.
For a detailed description of the parameter of the evaluator, look at the javadoc in the sourcecode (as of today, the one that you’ll found on the web is outdated since it concern mahout release 0.1). But basically:
- 0.9 here represents the percentage of each user’s preferences to use to produce recommendations, the rest are compared to estimated preference values to evaluate.
- 1 represent the percentage of users to use in evaluation (so here all users).

Result?

RMSE = 0.8988.
To give you a point of comparison, the Netflix baseline predictor (called Cinematch) had an RMSE of 0.9514 and the Grand Prize was for the team providing an improvement of 10% (not that this tutorial is not based on netflix data but on Movie Lens data).

The number not really matters here: the important thing is that it provide you an easy way to compare different algorithms or the same algorithm with different settings (thresholds or other parameters).

Step 4: Now start the real work…

You guessed that you won’t win any Prize using the recommenders given by Mahout as-is :) .
Depending on your data and on your needs, you may have either to simply customize an existing algorithm or to plug any specific similarity measure or to create your very own recommender from scratch. All of those are pretty easy to do in Mahout.

Let’s say for instance that you want to exploit the category of the movies to build a specific user similarity that includes this information.
What you will have to do is first to be able to capture the new information about categories.

To do so you can for instance extends the FileDataModel class to another class that also parses the movies.dat file and build relevant data structures to store the data about categories. I found more convenient to build my own Statistics object. Then you will have to build a new User similarity. It is as simple as that:

import org.apache.mahout.cf.taste.common.Refreshable;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.similarity.PreferenceInferrer;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
 
import com.padjiman.algo.Statistics;
 
public class ProfileSimilarity implements UserSimilarity {
 
	private final Statistics stats;
	private final DataModel dataModel;
 
        public ProfileSimilarity(Statistics stats, DataModel dataModel) {
		super();
		if (stats == null) {
			throw new IllegalArgumentException("stats is null");
		}
		if (dataModel == null) {
		      throw new IllegalArgumentException("dataModel is null");
		    }
		this.dataModel = dataModel;
		this.stats = stats;
	}
 
	@Override
	public double userSimilarity(long userID1, long userID2)
	throws TasteException {
		//build your similarity function here
		//exploiting the stats and dataModel object as you wish
	}
 
	@Override
	public void refresh(Collection alreadyRefreshed) {
		// TODO Auto-generated method stub
	}
 
	@Override
	public void setPreferenceInferrer(PreferenceInferrer inferrer) {
		// TODO Auto-generated method stub
	}
}

Complete the method userSimilarity with you own secret sauce. Et voila: you can now plug your new user similarity measure in a GenericUserBasedRecommender, for instance instead of the Pearson correlation similarity measure (showed in step 2) and simply compare which one is the best using your evaluator.

You’re not satisfied with the GenericUserBasedRecommender or any other recommender provided by Mahout? No problem, try to implement your own. You’ll just have to start with a class declaration of this kind:

public class MostPopularItemUserBasedCombinedRecommender extends AbstractRecommender implements Recommender {
       //override the necessary methods
}

Here again, you can use as member of the class any customized object containing any statistics that you would judge relevant to build a better recommender. Then, again, plug your new recommender in the evaluator and compare, experiment, improve.

Conclusion

Mahout Taste is a very flexible platform to experiment collaborative filtering algorithms. It certainly won’t provide you an immediate solution to your recommendation problem, but you’ll be easily able to either tune the existing algorithms or plug your own creative ones into the mahout taste interfaces set.

By doing so, you’ll immediately get the benefit of a platform allowing you to compare, tune and improve iteratively the results of your different algorithm configurations. Last but not least, Mahout taste provide an external server which exposes recommendation logic to your application via web services and HTTP.

Other ressources:

  • After reading this quick start guide, I recommend you to have a look at the official mahout taste documentation. As of today it is not updated with the release 0.2 so you might find some old method signatures there but you’ll find useful and complementary information about the big picture of Mahout Taste design.
  • A nice article on mahout in general (not only the taste part). I felt that Taste was not enough detailed there, in particular on the testing part, this is why I wrote this tutorial.