There is a Collection.addAll() being called for each invocation of tokenStream() which is called many times per document. The StopFilter is the culprit. When I checked the constructor, it looks like this:
public StopFilter(boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase)
{
super(input);
if (stopWords instanceof CharArraySet) {
this.stopWords = (CharArraySet)stopWords;
} else {
this.stopWords = new CharArraySet(stopWords.size(), ignoreCase);
this.stopWords.addAll(stopWords);
}
this.enablePositionIncrements = enablePositionIncrements;
init();
}
As you can see, if the stopWords set is not of type org.apache.lucene.analysis.CharArraySet, a whole copy is made. I was using a regular Set
This was very instructive, as the reason I didn't use lucene's StopFilter.makeStopSet() and used a regular Set was assuming that under the covers Lucene made a regular Set, and I wanted to save it the trouble as I could generate a Set directly from my input.
Another reason to profile, profile and profile....
After changing my code to use the Lucene Set (by calling StopFilter.makeStopSet) here is the result, faster for sure:
The performance improvements could be significant. In a real-world scenario where this code was used in creating a 10G size index, I saw a 4X speed-up in indexing.
No comments:
Post a Comment