Hadoop Tutorial Series, Issue #3: Counters In Action

Note: This post has been updated with a code working for hadoop 0.20.1.

In this 3rd issue of the hadoop tutorial series, we’ll speak about a very simple but very useful hadoop’s feature: counters.

Even if you have never defined any counters in hadoop, you can see some of them each time you are running an hadoop job. Indeed, here is what you can see from the client console at the end of the execution of a job (can also be seen from the web interface):

Hadoop internal counters at the end of a job

As you can see, 18 internal counters are presented inside different groups. For instance, you can see a section “Job Counters” with three different counters giving basic information about the job like the number of mappers and reducers. In that example, “Job Counters” is called the group of the counter and “Launched reduce tasks” (for instance) is properly the name of the counter.

It is very handy to define your own counters to track any kind of statistics about the records you are manipulating in the mapper and the reducer. The most natural use of that is to use counters to track the number of malformed records.

If you are executing a job and you see an abnormally high number of malformed records, it can give a good hint that you perhaps have a bug in your code or some problem with your data (note this is actually a much simpler way to spot issues than tracking error messages in a distributed set of log files). But you can actually use counters for any kind of other statistics on your records.

One easy way to define your own counters from your Java code is:

Declaring an enum representing your counters. The enum name is the group of the counter, and each field of the enum is the name of the counter that will be reported in this same group
Incrementing the desired counters from your map and reduce methods through the Context of your mapper or reducer (in previous hadoop version it was through the Reporter.incrCounter() method, but the reporter no longer exists in hadoop 0.20)

So let’s see an example. We’ll take the word count example revised for version 0.20.1 to illustrate the use of counters. We will create a counter group called WordsNature that will count how many unique tokens there is in all, how many unique tokens starts with a digit and how many unique tokens starts with a letter.

So our enum declaration will look like that:

 static enum WordsNature { STARTS_WITH_DIGIT, STARTS_WITH_LETTER, ALL }

We will also need a very basic StringUtils class:

package com.philippeadjiman.hadooptraining;

public class StringUtils {

	public static boolean startsWithDigit(String s){
		if( s == null || s.length() == 0 )
			return false;

		return Character.isDigit(s.charAt(0));
	}

	public static boolean startsWithLetter(String s){
		if( s == null || s.length() == 0 )
			return false;

		return Character.isLetter(s.charAt(0));
	}

}

Since we are interested in unique tokens, we will put the code related with the counter into the reduce method. So here how the reduce method will look like:

public void reduce(Text key, Iterable values, Context context)
	throws IOException, InterruptedException {

	int sum = 0;
	String token = key.toString();
	if( StringUtils.startsWithDigit(token) ){
		context.getCounter(WordsNature.STARTS_WITH_DIGIT).increment(1);
	}
	else if( StringUtils.startsWithLetter(token) ){
		context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);
	}
	context.getCounter(WordsNature.ALL).increment(1);
	for (IntWritable value : values) {
		sum += value.get();
	}
	context.write(key, new IntWritable(sum));
}

Here is the code of the WordCountWithCounter that include this code.

If you want to run it inside our learning playground you’ll just have to update the pom with hadoop latest version:

<dependency>
<groupId>org.apache.mahout.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.1</version>
</dependency>

So here is the result after running the code with, as input, the whole text of moby dick:

We can now see our home made counters.

So we can see now that we have 33783 unique tokens, 32511 starting with a letter and 263 starting with a digit. What about the 1009 others?? Well, because the word count example use a basic StringTokenizer that splits tokens at spaces, a lot of words simply starts with a ‘(‘ or with a ‘[‘ and even with ‘–‘. To solve that you can for instance use a lucene StandardAnalyzer.

You should now be able to easily implements your own counters for tracking bad records/missing values, debugging or gathering any kind of statistics around your job.

See you soon for another issue…

Comments

One response to “Hadoop Tutorial Series, Issue #3: Counters In Action”

Iterative algorithms in Hadoop « Kenkyuu

June 10, 2011

[…] The general idea for iterative algorithms in MapReduce is to chain multiple jobs together, using the output of the last one as the input of the next one. An important consideration is that, given the usual size of the data, the termination condition must be computed within the MapReduce program. The standard MapReduce model does not offer simple elegant ways to do this, but Hadoop has some added features that simplify this task: Counters. […]