Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This compression is used in lucene for monotonically increasing offsets, e.g. stored fields index, dv BINARY/SORTED_SET offsets, OrdinalMap (used for merging and faceting dv) and so on.

      Today this stores a +/- deviation from an expected line of y=mx + b, where b is the minValue for the block and m is the average delta from the previous value. Because it can be negative, we have to do some additional work to zigzag-decode.

      Can we just instead waste a bit for every value explicitly (lower the minValue by the min delta) so that deltas are always positive and we can have a simpler decode? Maybe If we do this, the new guy should assert that values are actually monotic at write-time. The current one supports "mostly monotic" but do we really need that flexibility anywhere? If so it could always be kept...

      1. LUCENE-5703.patch
        18 kB
        Adrien Grand

        Activity

        Hide
        Robert Muir added a comment -

        Also it would be nice to think about how much the floating point stuff saves compression-wise in practice. Maybe an integer average is enough?

        Show
        Robert Muir added a comment - Also it would be nice to think about how much the floating point stuff saves compression-wise in practice. Maybe an integer average is enough?
        Hide
        Yonik Seeley added a comment -

        This was essentially the approach I took in Heliosearch's off-heap cache, no zigzag, and an integer average.
        I don't see the point of zigzag in this specific context... if you know the range of offsets, then you can get the extra bit back by just using an offset. For example, -100 to 150 still maps to a single byte: 0-256 with an offset of -100.

        Show
        Yonik Seeley added a comment - This was essentially the approach I took in Heliosearch's off-heap cache, no zigzag, and an integer average. I don't see the point of zigzag in this specific context... if you know the range of offsets, then you can get the extra bit back by just using an offset. For example, -100 to 150 still maps to a single byte: 0-256 with an offset of -100.
        Hide
        Adrien Grand added a comment -

        I'm afraid the integer average could trigger some worst-cases space-wise. For example let's imagine that half your binary doc values have a length of 4 and the other half has a length of 5: the average length is 4.5 and we use a monotonic writer to store offsets.

        With an integer average you need to approximate the length with 4 or 5. This means that your maximum delta will be 0.5 * BLOCK_SIZE = 8192 with the current encoding, that is 14 bits per value (13 + 1 for zig-zag). With a float average there is a worst-case that typically happens if a block stores all 4s first and all 5s afterwards (max delta occurs at BLOCK_SIZE / 2 and is then 0.5 * BLOCK_SIZE / 2 = 4096) but if your 4s and 5s and more evenly distributed inside a block (which I expect to be quite likely), then you would only need a few bits per value.

        Show
        Adrien Grand added a comment - I'm afraid the integer average could trigger some worst-cases space-wise. For example let's imagine that half your binary doc values have a length of 4 and the other half has a length of 5: the average length is 4.5 and we use a monotonic writer to store offsets. With an integer average you need to approximate the length with 4 or 5. This means that your maximum delta will be 0.5 * BLOCK_SIZE = 8192 with the current encoding, that is 14 bits per value (13 + 1 for zig-zag). With a float average there is a worst-case that typically happens if a block stores all 4s first and all 5s afterwards (max delta occurs at BLOCK_SIZE / 2 and is then 0.5 * BLOCK_SIZE / 2 = 4096) but if your 4s and 5s and more evenly distributed inside a block (which I expect to be quite likely), then you would only need a few bits per value.
        Hide
        Adrien Grand added a comment -

        Here is a patch that adjusts the origin of a block (the B of A * x + B) so that all deltas are positive. This seems to improve performance by about 10% when using PackedInts.FASTEST on MonotonicAppendingLongBuffer, and it also reduces memory usage in certain cases since it might save a bit (eg. if the min delta was -1 and the max 6, we used to require bitsRequired(Math.max(6, 1)) + 1 = 4 bits while it would now require bitsRequired(6 - (-1)) = bitsRequired(7) = 3).

        Show
        Adrien Grand added a comment - Here is a patch that adjusts the origin of a block (the B of A * x + B ) so that all deltas are positive. This seems to improve performance by about 10% when using PackedInts.FASTEST on MonotonicAppendingLongBuffer , and it also reduces memory usage in certain cases since it might save a bit (eg. if the min delta was -1 and the max 6, we used to require bitsRequired(Math.max(6, 1)) + 1 = 4 bits while it would now require bitsRequired(6 - (-1)) = bitsRequired(7) = 3).
        Hide
        ASF subversion and git services added a comment -

        Commit 1600694 from Adrien Grand in branch 'dev/trunk'
        [ https://svn.apache.org/r1600694 ]

        LUCENE-5721: Monotonic compression doesn't use zig-zag encoding anymore.

        Show
        ASF subversion and git services added a comment - Commit 1600694 from Adrien Grand in branch 'dev/trunk' [ https://svn.apache.org/r1600694 ] LUCENE-5721 : Monotonic compression doesn't use zig-zag encoding anymore.
        Hide
        ASF subversion and git services added a comment -

        Commit 1600747 from Adrien Grand in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1600747 ]

        LUCENE-5721: Monotonic compression doesn't use zig-zag encoding anymore.

        Show
        ASF subversion and git services added a comment - Commit 1600747 from Adrien Grand in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1600747 ] LUCENE-5721 : Monotonic compression doesn't use zig-zag encoding anymore.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development