Attaching optimization patch. Results up front:
random seeks to common terms with term enumerator: 58% improvement
full iteration over all docs matching relatively unique terms: 1595% improvement
- MultiTermEnum keeps track of which segments match... if termDocs.seek(termEnum) is used, then MultiTermDocs will only visit segments that matched the term.
- MultiTermEnum defers calling next() on sub enumerators until needed. This allows MultiTermDocs to use the faster seek(enum) since the enumerator is still on the correct term. This also avoids unnecessary calls to next() that may never be used, as well as unnecessary insertions into the priority queue. Using seek(enum) in the sub TermDocs also allows cascading of these optimizations (in the event that one has a MultiReader of MultiReaders).
Test index: this was obviously stacked to show best-case performance for these optimizations. 999,999 documents with maxBufferedDocs=10, resulting in 46 segments. The full iteration test used relatively unique terms (1 or 2 docs matching each), and the random seeks test used very common terms (if rare terms are used in this test, the initial seek dominates and swamps any improvement from the deferral of calls to next().)