Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
New
Description
I think we should graduate the low-level block KD-tree data structure
from sandbox into Lucene's core?
This can be used for very fast 1D range filtering for numerics,
removing the 8 byte (long/double) limit we have today, so e.g. we
could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.
It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
(x/y/z with geo3d) geo shape intersection searches.
The idea here is to add a new part of the Codec API (DimensionalFormat
maybe?) that can do low-level N-dim point indexing and at runtime
exposes only an "intersect" method.
It should give sizable performance gains (smaller index, faster
searching) over what we have today, and even over what auto-prefix
with efficient numeric terms would do.
There are many steps here ... and I think adding this is analogous to
how we added FSTs, where we first added low level data structure
support and then gradually cutover the places that benefit from an
FST.
So for the first step, I'd like to just add the low-level block
KD-tree impl into oal.util.bkd, but make a couple improvements over
what we have now in sandbox:
- Use byte[] as the value not int (@rjernst's good idea!)
- Generalize it to arbitrary dimensions vs. specialized/forked 1D,
2D, 3D cases we have now
This is already hard enough After that we can build the
DimensionalFormat on top, then cutover existing specialized block
KD-trees. We also need to fix OfflineSorter to use Directory API so
we don't fill up /tmp when building a block KD-tree.
A block KD-tree is at heart an inverted data structure, like postings,
but is also similar to auto-prefix in that it "picks" proper
N-dimensional "terms" (leaf blocks) to index based on how the specific
data being indexed is distributed. I think this is a big part of why
it's so fast, i.e. in contrast to today where we statically slice up
the space into the same terms regardless of the data (trie shifting,
morton codes, geohash, hilbert curves, etc.)
I'm marking this as trunk only for now... as we iterate we can see if
it could maybe go back to 5.x...