Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6881

Cutover all BKD tree implementations to the codec

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is phase 4 for enabling indexing dimensional values in Lucene
      ... follow-on from LUCENE-6861.

      This issue removes the 3 pre-existing specialized experimental BKD
      implementations (BKD* in sandbox module for 2D lat/lon geo, BKD3D* in
      spatial3d module for 3D x/y/z geo, and range tree in sandbox module)
      and instead switches over to having the codec index the dimensional
      values.

      1. LUCENE-6881.patch
        499 kB
        Michael McCandless
      2. LUCENE-6881.patch
        518 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Initial patch, plenty of nocommits too, and some tests fail, but I think it's close ...

        Show
        mikemccand Michael McCandless added a comment - Initial patch, plenty of nocommits too, and some tests fail, but I think it's close ...
        Hide
        mikemccand Michael McCandless added a comment -

        New patch, fixing nocommits and improving tests and passing precommit.

        I think it's ready ...

        Show
        mikemccand Michael McCandless added a comment - New patch, fixing nocommits and improving tests and passing precommit. I think it's ready ...
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1713278 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1713278 ]

        LUCENE-6881: cutover all BKD implementations to the codec's implementation

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1713278 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1713278 ] LUCENE-6881 : cutover all BKD implementations to the codec's implementation
        Hide
        mikemccand Michael McCandless added a comment -

        Fixed (6.0 only) ... I'll post some performance results here soon ...

        Show
        mikemccand Michael McCandless added a comment - Fixed (6.0 only) ... I'll post some performance results here soon ...
        Hide
        mikemccand Michael McCandless added a comment -

        I re-ran the "lat/lon points in rects around London, UK" perf test from luceneutil (IndexOSM*.java and SearchOSM*.java sources).

        This test indexes 60.8 M lat/lon points derived from Open Street Maps data and then runs varying regularly spaced rectangles (225 queries in all) around London, UK.

        I used SMS and LogDocsMP to get to a 5/5/5 segment structure for all three tests, and so only a single thread is used throughout for fair comparison of indexing times:

        Spatial module, using RecursivePrefixTreeStrategy with PackedQuadPrefixTree at 25 levels:

        • 1,464 sec to index
        • 7.8 GB index on disk
        • 239 MB in-heap (ramBytesUsed summed across all segments)
        • 3.98 sec to run 225 searches (best of 100 iters)

        GeoPointField (sandbox)

        • 497 sec to index
        • 3.2 GB index on disk
        • 86 MB heap (ramBytesUsed summed across all segments)
        • 4.48 sec to run 225 searches (best of 100 iters)

        Dimensional values (this patch) using default codec's dimensional format

        • 744 sec to index
        • 704 MB index on disk
        • 2.3 MB heap (ramBytesUsed summed across all segments)
        • 2.85 sec to run 225 searches (best of 100 iters)

        The spatial module is purely postings, geo point field is postings + doc values, and dimensional values is the new BKD tree.

        Net/net indexing time for dimensional values approach is in between geo point field and spatial, but the resulting index as well as heap required at search time is much smaller, and the searching is faster.

        The search time for dimensional values is a bit slower than the specialized (to lat/lon) doc-values based BKD from LUCENE-6477 / LUCENE-6645 (2.32 sec to run 225 searches) but I think we can optimize things later.

        I haven't tested the 1D case, and I suspect there are important specializations we can make there, but I'll save that for a follow-on.

        Show
        mikemccand Michael McCandless added a comment - I re-ran the "lat/lon points in rects around London, UK" perf test from luceneutil ( IndexOSM*.java and SearchOSM*.java sources). This test indexes 60.8 M lat/lon points derived from Open Street Maps data and then runs varying regularly spaced rectangles (225 queries in all) around London, UK. I used SMS and LogDocsMP to get to a 5/5/5 segment structure for all three tests, and so only a single thread is used throughout for fair comparison of indexing times: Spatial module, using RecursivePrefixTreeStrategy with PackedQuadPrefixTree at 25 levels: 1,464 sec to index 7.8 GB index on disk 239 MB in-heap (ramBytesUsed summed across all segments) 3.98 sec to run 225 searches (best of 100 iters) GeoPointField (sandbox) 497 sec to index 3.2 GB index on disk 86 MB heap (ramBytesUsed summed across all segments) 4.48 sec to run 225 searches (best of 100 iters) Dimensional values (this patch) using default codec's dimensional format 744 sec to index 704 MB index on disk 2.3 MB heap (ramBytesUsed summed across all segments) 2.85 sec to run 225 searches (best of 100 iters) The spatial module is purely postings, geo point field is postings + doc values, and dimensional values is the new BKD tree. Net/net indexing time for dimensional values approach is in between geo point field and spatial, but the resulting index as well as heap required at search time is much smaller, and the searching is faster. The search time for dimensional values is a bit slower than the specialized (to lat/lon) doc-values based BKD from LUCENE-6477 / LUCENE-6645 (2.32 sec to run 225 searches) but I think we can optimize things later. I haven't tested the 1D case, and I suspect there are important specializations we can make there, but I'll save that for a follow-on.
        Hide
        dsmiley David Smiley added a comment -

        These are incredible numbers Michael McCandless; bravo!

        Curious; do you imagine that BKD might one day be modified to differentiate the indexed cells by two types – what I call a "leaf" and a non-leaf"? So in this way, a document could have BKD cells that are leaves and non-leaves which refer to indexing non-point shapes in which the inner cells entirely contained by some indexed shape have leaf cells and the non-leaf cells are those at the edge?

        Show
        dsmiley David Smiley added a comment - These are incredible numbers Michael McCandless ; bravo! Curious; do you imagine that BKD might one day be modified to differentiate the indexed cells by two types – what I call a "leaf" and a non-leaf"? So in this way, a document could have BKD cells that are leaves and non-leaves which refer to indexing non-point shapes in which the inner cells entirely contained by some indexed shape have leaf cells and the non-leaf cells are those at the edge?
        Hide
        mikemccand Michael McCandless added a comment -

        Curious; do you imagine that BKD might one day be modified to differentiate the indexed cells by two types

        I do think that's a natural (future!) generalization for DimensionalFormat, so we can index shapes.

        It's useful in the 1D case as well: it would let us index numeric ranges, or even unions of disjoint ranges, in a single document/field.

        But I think we should postpone this for now ... it would add quite a bit of complexity, and we need to see how dimensional values are used / hardened first.

        Show
        mikemccand Michael McCandless added a comment - Curious; do you imagine that BKD might one day be modified to differentiate the indexed cells by two types I do think that's a natural (future!) generalization for DimensionalFormat , so we can index shapes. It's useful in the 1D case as well: it would let us index numeric ranges, or even unions of disjoint ranges, in a single document/field. But I think we should postpone this for now ... it would add quite a bit of complexity, and we need to see how dimensional values are used / hardened first.
        Hide
        mikemccand Michael McCandless added a comment -

        I tested the 1D case as well, just indexing "lat" from the London UK test as an int, and comparing NumericRangeQuery with DimensionalRangeQuery.

        Sources are in luceneutil, look for IndexAndSearchOpenStreetMaps1D.java.

        IntField / NumericRangeQuery

        • 176 sec to index
        • 744 MB index on disk
        • 14.4 MB in-heap (ramBytesUsed summed across all segments)
        • 7.3 sec to run 225 searches (best of 100 iters)

        DimensionalField / DimensionalRangeQuery

        • 363 sec to index
        • 472 MB index on disk
        • 2.3 MB in-heap (ramBytesUsed summed across all segments)
        • 5.4 sec to run 225 searches (best of 100 iters)

        So again a slower indexing time, but then a smaller (less so vs. the 2D case) index, much less heap used, and faster queries.

        Show
        mikemccand Michael McCandless added a comment - I tested the 1D case as well, just indexing "lat" from the London UK test as an int, and comparing NumericRangeQuery with DimensionalRangeQuery . Sources are in luceneutil, look for IndexAndSearchOpenStreetMaps1D.java . IntField / NumericRangeQuery 176 sec to index 744 MB index on disk 14.4 MB in-heap (ramBytesUsed summed across all segments) 7.3 sec to run 225 searches (best of 100 iters) DimensionalField / DimensionalRangeQuery 363 sec to index 472 MB index on disk 2.3 MB in-heap (ramBytesUsed summed across all segments) 5.4 sec to run 225 searches (best of 100 iters) So again a slower indexing time, but then a smaller (less so vs. the 2D case) index, much less heap used, and faster queries.

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development