Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7563

BKD index should compress unused leading bytes

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today the BKD (points) in-heap index always uses dimensionNumBytes per dimension, but if e.g. you are indexing LongPoint yet only use the bottom two bytes in a given segment, we shouldn't store all those leading 0s in the index.

      1. LUCENE-7563.patch
        193 kB
        Michael McCandless
      2. LUCENE-7563.patch
        83 kB
        Michael McCandless
      3. LUCENE-7563.patch
        76 kB
        Michael McCandless
      4. LUCENE-7563.patch
        76 kB
        Michael McCandless
      5. LUCENE-7563-prefixlen-unary.patch
        20 kB
        Adrien Grand
      6. LUCENE-7563-prefixlen-unary.patch
        9 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        +1

        Show
        jpountz Adrien Grand added a comment - +1
        Hide
        mikemccand Michael McCandless added a comment -

        Initial patch, compacting (at indexing time) the long[] leafBlockFPs
        and byte[] splitPackedValues in the current BKD index, into a smaller
        packed byte[] structure.

        I still have a number of nocommits but the core idea seems to be
        working ... Lucene's tests pass at least once.

        At first I prototyped using an FST to compact the binary tree index,
        and it reduced quite a bit, but then I realized even FST has some
        annoying inefficiency for this usage, so I made this dedicated packed
        byte[] structure instead which compressed even better.

        On our nightly 20M NYC taxis benchmark
        (https://home.apache.org/~mikemccand/lucenebench/sparseResults.html)
        this reduces heap usage by ~60%.

        It compresses N>1 dimension too but I haven't tested by how much...

        The structure should work very well for apps that e.g. index a short
        as if it were a long, prefix-coding away all those leading 0s. It
        takes advantage of how the BKD tree is traversed at search time,
        always starting at root and pushing down to children.

        Separately, I think we could consider increasing the max leaf block size
        from 1024 points to maybe 2048 or 4096 points ... I'll open a new
        issue for that.

        Show
        mikemccand Michael McCandless added a comment - Initial patch, compacting (at indexing time) the long[] leafBlockFPs and byte[] splitPackedValues in the current BKD index, into a smaller packed byte[] structure. I still have a number of nocommits but the core idea seems to be working ... Lucene's tests pass at least once. At first I prototyped using an FST to compact the binary tree index, and it reduced quite a bit, but then I realized even FST has some annoying inefficiency for this usage, so I made this dedicated packed byte[] structure instead which compressed even better. On our nightly 20M NYC taxis benchmark ( https://home.apache.org/~mikemccand/lucenebench/sparseResults.html ) this reduces heap usage by ~60%. It compresses N>1 dimension too but I haven't tested by how much... The structure should work very well for apps that e.g. index a short as if it were a long, prefix-coding away all those leading 0s. It takes advantage of how the BKD tree is traversed at search time, always starting at root and pushing down to children. Separately, I think we could consider increasing the max leaf block size from 1024 points to maybe 2048 or 4096 points ... I'll open a new issue for that.
        Hide
        mikemccand Michael McCandless added a comment -

        Another iteration on the patch; I think it's ready.

        I tested on the 20M sparse taxis data set and this change gives a
        sizable (~56% - ~59%) reduction in heap usage:

        • sparse-sorted: 6.14 MB -> 2.49 MB
        • sparse: 4.93 MB -> 2.17 MB
        • dense: 4.88 MB -> 2.09 MB
        Show
        mikemccand Michael McCandless added a comment - Another iteration on the patch; I think it's ready. I tested on the 20M sparse taxis data set and this change gives a sizable (~56% - ~59%) reduction in heap usage: sparse-sorted: 6.14 MB -> 2.49 MB sparse: 4.93 MB -> 2.17 MB dense: 4.88 MB -> 2.09 MB
        Hide
        jpountz Adrien Grand added a comment -

        It seems we are always delta coding with the split value of the parent level, but for the multi-dimensional case, I think it would be better to delta-code with the last split value that was on the same dimension? Otherwise compression would be very poor if both dimensions store a very different range of values?

        Something else I was wondering is whether we can make bigger gains. For instance we use whole bytes to store the split dimension or the prefix length while they only need 3 and 4 bits? In the multi-dimensional case we could store both on a single byte. Maybe we can do even better, I haven't though much about it.

        It doesn't need to be done in the same patch, but it would also be nice for SimpleText to not use the legacy format of the index. I'm not sure how to proceed however.

        Show
        jpountz Adrien Grand added a comment - It seems we are always delta coding with the split value of the parent level, but for the multi-dimensional case, I think it would be better to delta-code with the last split value that was on the same dimension? Otherwise compression would be very poor if both dimensions store a very different range of values? Something else I was wondering is whether we can make bigger gains. For instance we use whole bytes to store the split dimension or the prefix length while they only need 3 and 4 bits? In the multi-dimensional case we could store both on a single byte. Maybe we can do even better, I haven't though much about it. It doesn't need to be done in the same patch, but it would also be nice for SimpleText to not use the legacy format of the index. I'm not sure how to proceed however.
        Hide
        mikemccand Michael McCandless added a comment -

        It seems we are always delta coding with the split value of the parent level, but for the multi-dimensional case, I think it would be better to delta-code with the last split value that was on the same dimension?

        Hmm I think I am already doing that? Note that the
        splitValuesStack in BKDReader.PackedIndexTree holds all
        dimensions' last split values, and then when I read the suffix bytes
        in, I copy them into the packed values for the current split
        dimension:

                in.readBytes(splitValuesStack[level], splitDim*bytesPerDim+prefix, suffix);
        

        I think?

        I'll test on the OpenStreetMaps geo benchmark to measure the impact
        ... I'll also run the 2B tests to make sure nothing broke.

        For instance we use whole bytes to store the split dimension or the prefix length while they only need 3 and 4 bits? In the multi-dimensional case we could store both on a single byte.

        Oooh that's a great idea! Saves 1 byte per inner node. We need 5
        bits for the prefix I think since it can range 0 .. 16 inclusive, and
        3 bits for the splitDim since it's 0 .. 7 inclusive.

        It doesn't need to be done in the same patch, but it would also be nice for SimpleText to not use the legacy format of the index. I'm not sure how to proceed however.

        Yeah I'm not sure what to do here either ... but it felt wrong to just
        pass these packed bytes to the simple text format ... that packed form
        is even further from "simple" than the two arrays we have now.

        Show
        mikemccand Michael McCandless added a comment - It seems we are always delta coding with the split value of the parent level, but for the multi-dimensional case, I think it would be better to delta-code with the last split value that was on the same dimension? Hmm I think I am already doing that? Note that the splitValuesStack in BKDReader.PackedIndexTree holds all dimensions' last split values, and then when I read the suffix bytes in, I copy them into the packed values for the current split dimension: in.readBytes(splitValuesStack[level], splitDim*bytesPerDim+prefix, suffix); I think? I'll test on the OpenStreetMaps geo benchmark to measure the impact ... I'll also run the 2B tests to make sure nothing broke. For instance we use whole bytes to store the split dimension or the prefix length while they only need 3 and 4 bits? In the multi-dimensional case we could store both on a single byte. Oooh that's a great idea! Saves 1 byte per inner node. We need 5 bits for the prefix I think since it can range 0 .. 16 inclusive, and 3 bits for the splitDim since it's 0 .. 7 inclusive. It doesn't need to be done in the same patch, but it would also be nice for SimpleText to not use the legacy format of the index. I'm not sure how to proceed however. Yeah I'm not sure what to do here either ... but it felt wrong to just pass these packed bytes to the simple text format ... that packed form is even further from "simple" than the two arrays we have now.
        Hide
        jpountz Adrien Grand added a comment -

        Hmm I think I am already doing that?

        You are right, I had not read the code correctly.

        Oooh that's a great idea! Saves 1 byte per inner node. We need 5 bits for the prefix I think since it can range 0 .. 16 inclusive, and 3 bits for the splitDim since it's 0 .. 7 inclusive.

        I have been thinking about it more and I think we can make it more general. The first two bytes that differ are likely close to each other, so if we call their difference firstByteDelta, we could pack firstByteDelta, splitDim and prefix into a single vint (eg. (firstByteDelta * (1 + bytesPerDim) + prefix) * numDims + splitDim) that would sometimes only take one byte (quite often when numDims and bytesPerDim are small and rarely in the opposite case).

        but it felt wrong to just pass these packed bytes to the simple text format ...

        Agreed. Maybe we should duplicate the curent BKDReader/BKDWriter into a new impl that would be specific to SimpleText and would not need all those optimizations so that both impls can evolve separately.

        Show
        jpountz Adrien Grand added a comment - Hmm I think I am already doing that? You are right, I had not read the code correctly. Oooh that's a great idea! Saves 1 byte per inner node. We need 5 bits for the prefix I think since it can range 0 .. 16 inclusive, and 3 bits for the splitDim since it's 0 .. 7 inclusive. I have been thinking about it more and I think we can make it more general. The first two bytes that differ are likely close to each other, so if we call their difference firstByteDelta , we could pack firstByteDelta , splitDim and prefix into a single vint (eg. (firstByteDelta * (1 + bytesPerDim) + prefix) * numDims + splitDim ) that would sometimes only take one byte (quite often when numDims and bytesPerDim are small and rarely in the opposite case). but it felt wrong to just pass these packed bytes to the simple text format ... Agreed. Maybe we should duplicate the curent BKDReader/BKDWriter into a new impl that would be specific to SimpleText and would not need all those optimizations so that both impls can evolve separately.
        Hide
        mikemccand Michael McCandless added a comment -

        New patch, folding in Adrien Grand's first idea. I like the second idea
        ... I'll try that next.

        I tested on LatLonPoint and Geo3D with the ~60M document
        OpenStreetMaps geo benchmark and it reduces heap usage from from 2.29
        MB -> 1.79 (Geo3D) and 2.29 -> 1.77 (LatLonPoint), ~22% smaller.

        Show
        mikemccand Michael McCandless added a comment - New patch, folding in Adrien Grand 's first idea. I like the second idea ... I'll try that next. I tested on LatLonPoint and Geo3D with the ~60M document OpenStreetMaps geo benchmark and it reduces heap usage from from 2.29 MB -> 1.79 (Geo3D) and 2.29 -> 1.77 (LatLonPoint), ~22% smaller.
        Hide
        mikemccand Michael McCandless added a comment -

        New patch; I think it's ready.

        This breaks out a private BKD implementation for SimpleText which
        is a nice cleanup for the core BKD implementation, e.g. BKDReader
        is now final; its strange protected constructor is gone; protected
        methods are now private.

        This patch also implements Adrien Grand's last compression idea, to often
        use only 1 byte to encode prefix, splitDim and first-byte-delta of the
        suffix instead of the 2 bytes required in the previous iterations.
        This gives another ~4-5% further compression improvement:

        • sparse-sorted -> 2.37 MB
        • sparse -> 2.07 MB
        • dense -> 2.00 MB

        And the OpenStreetMaps geo benchmark:

        • geo3d -> 1.75 MB
        • LatLonPoint -> 1.72 MB

        I'm running the 2B BKD and Points tests now ... if those pass, I plan
        to push to master first and let this bake a bit before backporting.

        Show
        mikemccand Michael McCandless added a comment - New patch; I think it's ready. This breaks out a private BKD implementation for SimpleText which is a nice cleanup for the core BKD implementation, e.g. BKDReader is now final; its strange protected constructor is gone; protected methods are now private. This patch also implements Adrien Grand 's last compression idea, to often use only 1 byte to encode prefix, splitDim and first-byte-delta of the suffix instead of the 2 bytes required in the previous iterations. This gives another ~4-5% further compression improvement: sparse-sorted -> 2.37 MB sparse -> 2.07 MB dense -> 2.00 MB And the OpenStreetMaps geo benchmark: geo3d -> 1.75 MB LatLonPoint -> 1.72 MB I'm running the 2B BKD and Points tests now ... if those pass, I plan to push to master first and let this bake a bit before backporting.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5e8db2e068f2549b9619d5ac48a50c8032fc292b in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e8db2e ]

        LUCENE-7563: use a compressed format for the in-heap BKD index

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5e8db2e068f2549b9619d5ac48a50c8032fc292b in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e8db2e ] LUCENE-7563 : use a compressed format for the in-heap BKD index
        Hide
        jpountz Adrien Grand added a comment -

        The change looks good and the drop is quite spectacular. http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#searcher_heap I think there is just a redundant arraycopy in clone()?

        For the record, I played with another idea leveraging the fact that the prefix lengths on two consecutive levels are likely close to each other, and the most common values for the deltas are 0, then 1, then -1. So we might be able to do more savings by encoding the delta between consecutive prefix length using unary coding on top of zig-zag encoding, which would allow to encode 0 on 1 bit, 1 on 2 bits, 2 on 3 bits, etc. However it only saved 1% memory on IndexOSM and less than 1% on IndexTaxis. I'm attaching it here if someone wants to have a look but I don't think the gains are worth the complexity.

        Show
        jpountz Adrien Grand added a comment - The change looks good and the drop is quite spectacular. http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#searcher_heap I think there is just a redundant arraycopy in clone() ? For the record, I played with another idea leveraging the fact that the prefix lengths on two consecutive levels are likely close to each other, and the most common values for the deltas are 0, then 1, then -1. So we might be able to do more savings by encoding the delta between consecutive prefix length using unary coding on top of zig-zag encoding, which would allow to encode 0 on 1 bit, 1 on 2 bits, 2 on 3 bits, etc. However it only saved 1% memory on IndexOSM and less than 1% on IndexTaxis. I'm attaching it here if someone wants to have a look but I don't think the gains are worth the complexity.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit bd8b191505d92c89a483a6189497374238476a00 in lucene-solr's branch refs/heads/master from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd8b191 ]

        LUCENE-7563: remove redundant array copy in PackedIndexTree.clone

        Show
        jira-bot ASF subversion and git services added a comment - Commit bd8b191505d92c89a483a6189497374238476a00 in lucene-solr's branch refs/heads/master from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd8b191 ] LUCENE-7563 : remove redundant array copy in PackedIndexTree.clone
        Hide
        mikemccand Michael McCandless added a comment -

        I think there is just a redundant arraycopy in clone()?

        Thanks, I pushed a fix!

        For the record, I played with another idea leveraging the fact that the prefix lengths on two consecutive levels are likely close to each other,

        I like this idea! But I hit this test failure ... doesn't reproduce on trunk:

           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBKD -Dtests.method=testWastedLeadingBytes -Dtests.seed=2E5F0E183BBA1098 -Dtests.locale=es-PR -Dtests.timezone=CST -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
           [junit4] ERROR   0.90s J1 | TestBKD.testWastedLeadingBytes <<<
           [junit4]    > Throwable #1: java.lang.ArrayIndexOutOfBoundsException: -32
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([2E5F0E183BBA1098:ABD9D50B47794EFC]:0)
           [junit4]    > 	at org.apache.lucene.util.bkd.BKDReader$PackedIndexTree.readNodeData(BKDReader.java:442)
           [junit4]    > 	at org.apache.lucene.util.bkd.BKDReader$PackedIndexTree.<init>(BKDReader.java:343)
           [junit4]    > 	at org.apache.lucene.util.bkd.BKDReader.getIntersectState(BKDReader.java:526)
           [junit4]    > 	at org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:498)
           [junit4]    > 	at org.apache.lucene.util.bkd.TestBKD.testWastedLeadingBytes(TestBKD.java:1042)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
        
        Show
        mikemccand Michael McCandless added a comment - I think there is just a redundant arraycopy in clone()? Thanks, I pushed a fix! For the record, I played with another idea leveraging the fact that the prefix lengths on two consecutive levels are likely close to each other, I like this idea! But I hit this test failure ... doesn't reproduce on trunk: [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestBKD -Dtests.method=testWastedLeadingBytes -Dtests.seed=2E5F0E183BBA1098 -Dtests.locale=es-PR -Dtests.timezone=CST -Dtests.asserts=true -Dtests.file.encoding=US-ASCII [junit4] ERROR 0.90s J1 | TestBKD.testWastedLeadingBytes <<< [junit4] > Throwable #1: java.lang.ArrayIndexOutOfBoundsException: -32 [junit4] > at __randomizedtesting.SeedInfo.seed([2E5F0E183BBA1098:ABD9D50B47794EFC]:0) [junit4] > at org.apache.lucene.util.bkd.BKDReader$PackedIndexTree.readNodeData(BKDReader.java:442) [junit4] > at org.apache.lucene.util.bkd.BKDReader$PackedIndexTree.<init>(BKDReader.java:343) [junit4] > at org.apache.lucene.util.bkd.BKDReader.getIntersectState(BKDReader.java:526) [junit4] > at org.apache.lucene.util.bkd.BKDReader.intersect(BKDReader.java:498) [junit4] > at org.apache.lucene.util.bkd.TestBKD.testWastedLeadingBytes(TestBKD.java:1042) [junit4] > at java.lang.Thread.run(Thread.java:745)
        Hide
        jpountz Adrien Grand added a comment -

        I digged into it, the test failure may happen with large numbers of bytes per dimension. It could be fixed if we limited the number of bytes per value of BKDWriter to 16 (like we do in FieldInfos) and made code a long.

        Show
        jpountz Adrien Grand added a comment - I digged into it, the test failure may happen with large numbers of bytes per dimension. It could be fixed if we limited the number of bytes per value of BKDWriter to 16 (like we do in FieldInfos) and made code a long.
        Hide
        mikemccand Michael McCandless added a comment -

        Ahh, OK; I think we should restrict TestBKD to the same dimension count / bytes per dimension limits that Lucene enforces? As we tighten up how we compress it on disk and the in-heap index we should only test for what we actually offer to the end user.

        Show
        mikemccand Michael McCandless added a comment - Ahh, OK; I think we should restrict TestBKD to the same dimension count / bytes per dimension limits that Lucene enforces? As we tighten up how we compress it on disk and the in-heap index we should only test for what we actually offer to the end user.
        Hide
        jpountz Adrien Grand added a comment -

        Here is an updated patch that makes BKDWriter ensure each dimension has 16 bytes at most, plus some minor tweaks to get a few bits back on average.

        Show
        jpountz Adrien Grand added a comment - Here is an updated patch that makes BKDWriter ensure each dimension has 16 bytes at most, plus some minor tweaks to get a few bits back on average.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 5e8db2e068f2549b9619d5ac48a50c8032fc292b in lucene-solr's branch refs/heads/apiv2 from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e8db2e ]

        LUCENE-7563: use a compressed format for the in-heap BKD index

        Show
        jira-bot ASF subversion and git services added a comment - Commit 5e8db2e068f2549b9619d5ac48a50c8032fc292b in lucene-solr's branch refs/heads/apiv2 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5e8db2e ] LUCENE-7563 : use a compressed format for the in-heap BKD index
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit bd8b191505d92c89a483a6189497374238476a00 in lucene-solr's branch refs/heads/apiv2 from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd8b191 ]

        LUCENE-7563: remove redundant array copy in PackedIndexTree.clone

        Show
        jira-bot ASF subversion and git services added a comment - Commit bd8b191505d92c89a483a6189497374238476a00 in lucene-solr's branch refs/heads/apiv2 from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bd8b191 ] LUCENE-7563 : remove redundant array copy in PackedIndexTree.clone
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit f51766c00fc374a6fc6f407b723bd8458556de7d in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f51766c ]

        LUCENE-7563: use a compressed format for the in-heap BKD index

        Show
        jira-bot ASF subversion and git services added a comment - Commit f51766c00fc374a6fc6f407b723bd8458556de7d in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f51766c ] LUCENE-7563 : use a compressed format for the in-heap BKD index
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit fd1f608b49a7a8b5f7e6cc805378da2217ec657a in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fd1f608 ]

        LUCENE-7563: remove redundant array copy in PackedIndexTree.clone

        Show
        jira-bot ASF subversion and git services added a comment - Commit fd1f608b49a7a8b5f7e6cc805378da2217ec657a in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=fd1f608 ] LUCENE-7563 : remove redundant array copy in PackedIndexTree.clone
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 0c8e8e396a4ccc41e6af78ac7d0342716c36902a in lucene-solr's branch refs/heads/branch_6x from Mike McCandless
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0c8e8e3 ]

        LUCENE-7563: fix 6.x backport compilation errors

        Show
        jira-bot ASF subversion and git services added a comment - Commit 0c8e8e396a4ccc41e6af78ac7d0342716c36902a in lucene-solr's branch refs/heads/branch_6x from Mike McCandless [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0c8e8e3 ] LUCENE-7563 : fix 6.x backport compilation errors

          People

          • Assignee:
            Unassigned
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development