tech: February 2010

Saturday, February 20, 2010

Interesting performance characteristic in Lucene.Analysis.StopFilter

I was tracking down a slow indexing process and got this callstack with YourKit:

There is a Collection.addAll() being called for each invocation of tokenStream() which is called many times per document. The StopFilter is the culprit. When I checked the constructor, it looks like this:

  public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase)
  {
    super(input);
    if (stopWords instanceof CharArraySet) {
      this.stopWords = (CharArraySet)stopWords;
    } else {
      this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
      this.stopWords.addAll(stopWords);
    }
    this.enablePositionIncrements = enablePositionIncrements;
    init();
  }

As you can see, if the stopWords set is not of type org.apache.lucene.analysis.CharArraySet, a whole copy is made. I was using a regular Set for the stopWords and that was the problem.

This was very instructive, as the reason I didn't use lucene's StopFilter.makeStopSet() and used a regular Set was assuming that under the covers Lucene made a regular Set, and I wanted to save it the trouble as I could generate a Set directly from my input.

Another reason to profile, profile and profile....

After changing my code to use the Lucene Set (by calling StopFilter.makeStopSet) here is the result, faster for sure:

The performance improvements could be significant. In a real-world scenario where this code was used in creating a 10G size index, I saw a 4X speed-up in indexing.