Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Dimensional values give a smaller index, and faster search times, for indexing ordered byte[] values across one or more dimensions, vs our existing approaches, but the indexing time is substantially slower.

      Since the 1D case is so important/common (numeric fields, range query) I think it's worth optimizing its indexing time. It should also be possible to optimize the N > 1 dimensions case too, but it's more complex ... we can postpone that.

      So for the 1D case, I changed the merge method to do a merge sort (like postings) of the already sorted segments dimensional values, instead of simply re-indexing all values from the incoming segments, and this was a big speedup.

      I also changed from InPlaceMergeSorter to IntroSorter (this is what postings use, and it's faster but still safe) and this was another good speedup, which should also help the > 1D cases.

      Finally, I added a BKDReader.verify method (currently it's dark: NOT called) that walks the index and then check that every value in each leaf block does in fact fall within what the index expected/claimed. This is useful for finding bugs! Maybe we can cleanly fold it into CheckIndex somehow later.

      1. LUCENE-6901-timsort.patch
        58 kB
        Michael McCandless
      2. LUCENE-6901.patch
        55 kB
        Michael McCandless
      3. LUCENE-6901.patch
        56 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Patch, I think it's close, but there are a still a few nocommits.

        I'll run performance tests ...

        Show
        mikemccand Michael McCandless added a comment - Patch, I think it's close, but there are a still a few nocommits. I'll run performance tests ...
        Hide
        mikemccand Michael McCandless added a comment -

        OK I tested the 1D case: this patch reduces indexing time from 370.2 sec (trunk) to 69.0 sec (this patch) for the 32 bit case, just indexing and searching on latitude (quantized to an int) from the 2D London, UK benchmark.

        This is quite a bit faster than NumericField which takes 175.6 sec to build the index.

        Index size is the same, heap used at search time is a bit smaller (2.3 MB -> 2.1 MB) just because the merge implementation packs each leaf block with the maximum allowed count vs indexing which is between 50% and 100% of the maximum, and search speed is the same.

        I'll test 2D next ... the IntroSorter change should have sped that up somewhat ... I'm going to try TimSorter next

        Show
        mikemccand Michael McCandless added a comment - OK I tested the 1D case: this patch reduces indexing time from 370.2 sec (trunk) to 69.0 sec (this patch) for the 32 bit case, just indexing and searching on latitude (quantized to an int) from the 2D London, UK benchmark. This is quite a bit faster than NumericField which takes 175.6 sec to build the index. Index size is the same, heap used at search time is a bit smaller (2.3 MB -> 2.1 MB) just because the merge implementation packs each leaf block with the maximum allowed count vs indexing which is between 50% and 100% of the maximum, and search speed is the same. I'll test 2D next ... the IntroSorter change should have sped that up somewhat ... I'm going to try TimSorter next
        Hide
        mikemccand Michael McCandless added a comment -

        OK for the 2D case this patch brings indexing time from 737.1 sec (trunk) to 441.5 sec (this patch), which is nice

        Note that the test is entirely single threaded: one indexing thread, SerialMergeScheduler.

        Trying TimSorter next ...

        Show
        mikemccand Michael McCandless added a comment - OK for the 2D case this patch brings indexing time from 737.1 sec (trunk) to 441.5 sec (this patch), which is nice Note that the test is entirely single threaded: one indexing thread, SerialMergeScheduler. Trying TimSorter next ...
        Hide
        dsmiley David Smiley added a comment -

        Nice indeed

        I would appreciate hearing your thoughts on why you choose specific sort algorithms over others.

        Show
        dsmiley David Smiley added a comment - Nice indeed I would appreciate hearing your thoughts on why you choose specific sort algorithms over others.
        Hide
        mikemccand Michael McCandless added a comment -

        Here's the TimSorter based patch, but it's a bit slower indexing time: 1D is 82.8 sec with TimSorter but 69.0 sec with the first patch (using IntroSorter), while 2D is 467.1 sec with TimSorter and 441.5 sec with IntroSorter.

        At first the 2D case surprised me because we are sorting already sorted inputs during merge ... need to think more about why TimSorter didn't help here.

        Anyway I'll clean up the IntroSorter patch ...

        Show
        mikemccand Michael McCandless added a comment - Here's the TimSorter based patch, but it's a bit slower indexing time: 1D is 82.8 sec with TimSorter but 69.0 sec with the first patch (using IntroSorter ), while 2D is 467.1 sec with TimSorter and 441.5 sec with IntroSorter . At first the 2D case surprised me because we are sorting already sorted inputs during merge ... need to think more about why TimSorter didn't help here. Anyway I'll clean up the IntroSorter patch ...
        Hide
        mikemccand Michael McCandless added a comment -

        New patch, fixing nocommits and a couple test failures after beasting ... I think it's ready.

        Show
        mikemccand Michael McCandless added a comment - New patch, fixing nocommits and a couple test failures after beasting ... I think it's ready.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1716189 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1716189 ]

        LUCENE-6901: speed up dimensional values indexing and merging

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1716189 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1716189 ] LUCENE-6901 : speed up dimensional values indexing and merging

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development