So for the whole segment, does it decide to insert auto-prefix'es at specific byte lengths (e.g. 3, 5, and 7)? Or does it vary based on specific terms? I'm hoping it's smart enough to vary based on specific terms. For example if, hypothetically there were lots of terms that had this common prefix: "BGA" then it might decide "BGA" makes a good auto-prefix but not necessarily all terms at length 3 since many others might not make good prefixes. Make sense?
It's dynamic, based on how terms occupy the space.
Today (and we can change this: it's an impl. detail) it assigns
prefixes just like it assigns terms to blocks. Ie, when it sees a
given prefix matches 25 - 48 terms, it inserts a new auto-prefix
term. That auto-prefix term replaces those 25 - 48 other terms with 1
term, and then the process "recurses", i.e. you can then have a
shorter auto-prefix term matching 25 - 48 other normal and/or
At a low level, do I take advantage of this in the same way that I might do so at a high level using PrefixQuery and then getting the weight then getting the scorer to iterate docIds? Or is there a lower-level path? Although there is some elegance to not introducing new APIs, I think it's worth exploring having prefix & range capabilities be on the TermsEnum in some way.
What this patch does is generalize/relax Terms.intersect: that method
no longer guarantees that you see "true" terms. Rather, it now
guarantees only that the docIDs you visit will be the same.
So to take advantage of it, you need to pass an Automaton to
Terms.intersect and then not care about which terms you see, only the
docIDs after visiting the DocsEnum across all terms it returned to
Do you envision other posting formats being able to re-use the logic here? That would be nice.
I agree it would be nice ... and the index-time logic that identifies
the auto-prefix terms is quite standalone, so e.g. we could pull it
out and have it "wrap" the incoming Fields to insert the auto-prefix
terms. This way it's nearly transparent to any postings format ...
But the problem is, at search time, there's tricky logic in
intersect() to use these prefix terms ... factoring that out so other
formats can use it is trickier I think... though maybe we could fold
it into the default Terms.intersect() impl...
In your future tuning, I suggest you give the ability to vary the conservative vs aggressive prefixing based on the very beginning and very end (assuming known common lengths). In the FlexPrefixTree Varun (GSOC) worked on, the leaves per level is configurable at each level (i.e. prefix length)... and it's better to have little prefixing at the very top and little at the bottom too. At the top, prefixes only help for queries span massive portions of the possible term space (which in spatial is rare; likely other apps too). And at the bottom (long prefixes) just shy of the maximum length (say 7 bytes out of 8 for a double), there is marginal value because in the spatial search algorithm, the bottom detail is scan'ed over (e.g. TermsEnum.next()) instead of seek'ed, because the data is less dense and it's adjacent. This principle may apply to numeric-range queries depending on how they are coded; I'm not sure.
I agree this (how auto-prefix terms are assigned) needs more control /
experimenting. Really the best prefixes are a function not only of
how the terms were distributed, but also of how queries will "likely"
ask for ranges.
I think it's similar to postings skip lists, where we have different
frequency of a skip pointer on the "leaf" level vs the "upper" skip