Category Archives: Uncategorized

How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

tfidfYou have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and  Mahout can help you to do that almost in a snap.

Step 1 : Build a Lucene Index out of your document collection

If you don’t know how to build a Lucene index, check the links at the end of the post.

The two only important things in that step are to have in your index a field that can serve as a document id and to enable term vectors on the text field representing the content of your documents.

So your indexing code should contains at least two lines similar to:

doc.add(new Field("documentId", documentId, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED,TermVector.YES));

Step 2 : Use Mahout lucene.vector driver to generate weighted vectors from your lucene index

That step is well described here. It also explains how to generate the vectors from a directory of text documents. I used lucene because my documents were in a data store and building the lucene index out of it was just much more flexible and convenient.

You then should end up executing a command similar to:

 ./mahout lucene.vector --dir "myLucenIndexDirectory" --output "outputVectorPathAndFilename" --dictOut "outputDictionnaryPathAndFilename" -f content -i documentId -w TFIDF

Mahout will generate for you:

  • a dictionary of all tokens found in the document collection (tokenized with the Tokenizer you used in step 1 and that you might tune depending on your needs)
  • A binary SequenceFile (a class coming from hadoop) that will contains all the TF-IDF weighted vectors.

Step 3: Play with the generated vector file

Now, let’s say that you want for a given document id, to see what are the tokens that received the biggest weights in order to feel what are the most significant tokens of that document (as the weighting scheme sees it).

To do so, you can for instance easily load the content of the generated dictionary file into a Map with token index as keys and the tokens as values. Let’s call that map dictionaryMap.

Then you’ll have to walk through the generated binary file containing the vectors. By playing a little bit  with the sequence file and the Mahout source code, you get pretty quickly what are the important objects you have to manipulate in order to access vectors content in a structured way:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String vectorsPath = args[1];
Path path = new Path(vectorsPath);

SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
LongWritable key = new LongWritable();
VectorWritable value = new VectorWritable();
while (, value)) {
	NamedVector namedVector = (NamedVector)value.get();
	RandomAccessSparseVector vect = (RandomAccessSparseVector)namedVector.getDelegate();

	for( Element  e : vect ){
		System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;

The important things to get in that code are the following:

  • namedVector.getName() will contains the documentId
  • e.index() will ontains the index of the token as present in the dictionary output file, so you can get the token itself using
  • e.get() contains the weight itself

From there you’ll be able easily to plug your code to do whatever you want with the tokens and their weights, like printing the token having the biggest weights in a given document.

It can be insightful to tune your weighting model. E.g. you can quickly observe that typing errors are often getting a super high weight, which makes sense in the TF-IDF weighting scheme (unless the typing error is very frequent in your document collection), and thus you might want to fix that.

It is also useful just to understand a little bit more of how mahout represents the data internally.

Useful links: