Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6891

Lucene60DimensionalFormat should use block prefix coding when writing values

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today we write the whole value for every doc in one leaf block in the BKD tree, but that's crazy because the whole point of that leaf block is all the docs inside it have values that are very close together.

      So I changed this to write the common prefix for the whole block up front in each block. This requires more index-time and search-time work, but gives nice index size reductions:

      On the 2D (London, UK) lat/lon benchmark:

      • Indexing time was a wee bit slower (743 -> 747 seconds)
      • Index size was ~11% smaller (704 MB -> 630 MB)
      • Query time was ~7% slower (2.84 sec -> 3.05 sec)
      • Heap usage is the same

      On the 1D (just "lat" from the above test) benchmark:

      • Indexing time was a wee bit slower (363 -> 364 sec)
      • Index size was ~23% smaller (472 MB -> 363 MB)
      • Query time was a wee bit slower (5.39 -> 5.41 sec)
      • Heap usage is the same

      Index time can be a bit slower since there are two passes now per leaf block (first to find the common prefix per dimension, and second pass must then strip those prefixes).

      Query time is slower because there's more work per hit that needs value filtering, i.e. collating the suffixes onto the prefixes, per dimension. This affects 2D much more than 1D because 1D has fewer leaf blocks that need filtering (typically 0, 1 or 2, unless there are many duplicate values in the index).

      I suspect the index size savings is use-case dependent, e.g. if you index a bunch of ipv4 addresses along with a few ipv6 addresses, you'd probably see sizable savings.

      I also suspect the more docs you index, the greater the savings, because the cells will generally be smaller.

      Net/net I think the opto is worth it, even if it slows query time a bit.

      1. LUCENE-6891.patch
        9 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Patch, tests/precommit passes, I think it's ready.

        Show
        mikemccand Michael McCandless added a comment - Patch, tests/precommit passes, I think it's ready.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1714749 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1714749 ]

        LUCENE-6891: use prefix coding when writing dimensional values in each leaf block

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1714749 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1714749 ] LUCENE-6891 : use prefix coding when writing dimensional values in each leaf block

          People

          • Assignee:
            mikemccand Michael McCandless
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development