Details
-
Task
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
3.0
-
New
Description
CharArraySet should be abstract
the hashing implementation currently being used should instead be called CharArrayHashSet
currently our 'CharArrayHashSet' is hardcoded across Lucene, but others might want their own impl.
For example, implementing CharArraySet as DFA with org.apache.lucene.util.automaton gives faster contains(char[], int, int) performance, as it can do a 'fast fail' and need not hash the entire string.
This is useful as it speeds up indexing in StopFilter.
I did not think this would be faster but i did benchmarks over and over with the reuters corpus, and it is, even with english text's wierd average word length of 5