Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6825

Add multidimensional byte[] indexing support to Lucene

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 6.0
    • None
    • None
    • New

    Description

      I think we should graduate the low-level block KD-tree data structure
      from sandbox into Lucene's core?

      This can be used for very fast 1D range filtering for numerics,
      removing the 8 byte (long/double) limit we have today, so e.g. we
      could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.

      It can also be used for > 1D use cases, like 2D (lat/lon) and 3D
      (x/y/z with geo3d) geo shape intersection searches.

      The idea here is to add a new part of the Codec API (DimensionalFormat
      maybe?) that can do low-level N-dim point indexing and at runtime
      exposes only an "intersect" method.

      It should give sizable performance gains (smaller index, faster
      searching) over what we have today, and even over what auto-prefix
      with efficient numeric terms would do.

      There are many steps here ... and I think adding this is analogous to
      how we added FSTs, where we first added low level data structure
      support and then gradually cutover the places that benefit from an
      FST.

      So for the first step, I'd like to just add the low-level block
      KD-tree impl into oal.util.bkd, but make a couple improvements over
      what we have now in sandbox:

      • Use byte[] as the value not int (@rjernst's good idea!)
      • Generalize it to arbitrary dimensions vs. specialized/forked 1D,
        2D, 3D cases we have now

      This is already hard enough After that we can build the
      DimensionalFormat on top, then cutover existing specialized block
      KD-trees. We also need to fix OfflineSorter to use Directory API so
      we don't fill up /tmp when building a block KD-tree.

      A block KD-tree is at heart an inverted data structure, like postings,
      but is also similar to auto-prefix in that it "picks" proper
      N-dimensional "terms" (leaf blocks) to index based on how the specific
      data being indexed is distributed. I think this is a big part of why
      it's so fast, i.e. in contrast to today where we statically slice up
      the space into the same terms regardless of the data (trie shifting,
      morton codes, geohash, hilbert curves, etc.)

      I'm marking this as trunk only for now... as we iterate we can see if
      it could maybe go back to 5.x...

      Attachments

        1. LUCENE-6825.patch
          102 kB
          Michael McCandless
        2. LUCENE-6825.patch
          72 kB
          Michael McCandless

        Activity

          People

            mikemccand Michael McCandless
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: