Phew that was fast!
Wow, you nuked the terms dict cache Nice!
Though it makes me a bit nervous... like there'll always be a risk
we've missed some path through Lucene that does two lookups... And,
even for external reasons (eg same query arrives to Lucene, looking
for next page or something), the cache is useful.
EG, a straight TermQuery (not spawned by MTQ) is now hitting the terms
dict twice. Once inside Sim.idfExplain, where it calls
searcher.docFreq(term), and then again to pull the scorers per sub
reader. Probably, TermQuery should pull the PerReaderTermState, up
front, if it wasn't already handed it? And then pass the docFreq to
Should we add a PerReaderTermState.docFreq(), which just sums up
across all subs?
Does TermState really need field()? Seems wasteful to have to store
that... eg an MTQ will store many TermStates against the same field.
I think we should keep TermState lean.
Also, I think it shouldn't need that clone method?
I think instead of duplicating docs/docsAndPositions (and soon
bulkPostings) on TermsEnum, once for TermState and once without, we
should just add a seek(TermState)? And then the single
docs/docsAndPositions/etc. method can be used to get the enum for that
term. (Likewise for Terms) Also, we should remove docFreq and ord
from TermsEnum since you should get it from TermState?
I think IndexReader can offer the sugar methods (that take either
BytesRef term or String field + TermState state).
Also: I tried to run the benchmark on beast but unfortunately there's
a bug somewhere (even though Lucene core tests pass) – I see
different results for some fuzzy queries.
Nice work!! Getting to single term lookup for all queries will be awesome!