Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene

lucene_green_300 If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.

What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams analyzer:

public class NGramAnalyzer extends Analyzer {
	@Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
       return new StopFilter(new LowerCaseFilter(new ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
           StopAnalyzer.ENGLISH_STOP_WORDS);
     }
}

The parameters of the ShingleMatrixFilter simply states the minimum and maximum shingle size. “Shingle” is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other “disturbers”.

To use the analyzer, you can for instance do:

	public static void main(String[] args) {
		try {
			String str = "An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene";
			Analyzer analyzer = new NGramAnalyzer();
			
			TokenStream stream = analyzer.tokenStream("content", new StringReader(str));
			Token token = new Token();
			while ((token = stream.next(token)) != null){
				System.out.println(token.term());
			}
			
		} catch (IOException ie) {
This is important for  prescription free tadalafil individuals who need this drug forever. Immunotherapy is intended to boost viagra online from canada  the recognition of cancer cells by the body's immune system, thereby helping the body to kill cancer cells. The order cialis pills  is taken orally before an hour to get the effect perfectly. Effexor is a generic cialis cheapest  reliable anti-depressant pill and this is evident from the users' reviews. 			System.out.println("IO Error " + ie.getMessage());
		}
	}

The output will print:

an easy
easy way
way to
to write
write an
an analyzer
analyzer for
for tokens
tokens bi
bi gram
gram or
or even
even tokens
tokens n
n grams
grams with
with lucene

Note that the text “bi-gram” was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.