Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: master (7.0), 6.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      For compressing values, BKDReader only relies on shared prefixes in a block. We could probably easily do better. For instance there are only 256 possible values for the first byte of the dimension that the values are sorted by, yet we use a block size of 1024. So by using something simple like run-length compression we could save 6 bits per value on average.

      1. LUCENE-7371.patch
        23 kB
        Adrien Grand
      2. LUCENE-7371.patch
        22 kB
        Adrien Grand
      3. LUCENE-7371.patch
        17 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch. On the IndexAndSearchOpenStreetMaps benchmark, it saves 9.7% storage (577MB -> 521B). The "distance" and "poly 5" benchmarks report the same response times but the "box" benchmark is about 4% slower.

        Show
        jpountz Adrien Grand added a comment - Here is a patch. On the IndexAndSearchOpenStreetMaps benchmark, it saves 9.7% storage (577MB -> 521B). The "distance" and "poly 5" benchmarks report the same response times but the "box" benchmark is about 4% slower.
        Hide
        jpountz Adrien Grand added a comment -

        I first thought the issue was with inlining since the methods have many arguments and I had made them bigger, but it turned out that the main issue was branch misprediction due the use of vints for encoding the run length since runs are almost alternatively less/greater than 127 (the boundary for 1-2 bytes with vints). So I capped the run length to 256 in order to be able to use one byte for run lengths all the time and things are now faster with compression (about 75.1 QPS on master and 78.2 QPS with the patch, a 4% improvement). Disk savings are similar to what they were with the previous iteration of the patch since the index is now 522MB on disk vs. 521MB with the previous iteration of the patch.

        Show
        jpountz Adrien Grand added a comment - I first thought the issue was with inlining since the methods have many arguments and I had made them bigger, but it turned out that the main issue was branch misprediction due the use of vints for encoding the run length since runs are almost alternatively less/greater than 127 (the boundary for 1-2 bytes with vints). So I capped the run length to 256 in order to be able to use one byte for run lengths all the time and things are now faster with compression (about 75.1 QPS on master and 78.2 QPS with the patch, a 4% improvement). Disk savings are similar to what they were with the previous iteration of the patch since the index is now 522MB on disk vs. 521MB with the previous iteration of the patch.
        Hide
        mikemccand Michael McCandless added a comment -

        This is a nice optimization! Patch looks good!

        The BKDWriter change to pick which dimension to apply the run-length coding to is best effort right? Because, you could have a dim with fewer unique leading suffix bytes, but a larger delta between first and last values? But it would take quite a bit more work at indexing time to figure it out ... maybe add a comment explaining this tradeoff? It seems likely the "min delta" approach should work well in practice, but have you tried with the slow-but-correct approach to verify?

        Also, I noticed TestBackwardsCompatibility seems not to test points! I'll go fix that ...

        Show
        mikemccand Michael McCandless added a comment - This is a nice optimization! Patch looks good! The BKDWriter change to pick which dimension to apply the run-length coding to is best effort right? Because, you could have a dim with fewer unique leading suffix bytes, but a larger delta between first and last values? But it would take quite a bit more work at indexing time to figure it out ... maybe add a comment explaining this tradeoff? It seems likely the "min delta" approach should work well in practice, but have you tried with the slow-but-correct approach to verify? Also, I noticed TestBackwardsCompatibility seems not to test points! I'll go fix that ...
        Hide
        jpountz Adrien Grand added a comment -

        Good point, here is an updated patch. I am getting the following indexing times, which suggests that it does not hurt:

        278.064423505 sec to index part 0 // master
        270.492947789 sec to index part 0 // patch
        

        The index size is also unchanged, which was expected since the previous heuristic should be equivalent with dense data.

        Also, I noticed TestBackwardsCompatibility seems not to test points! I'll go fix that ...

        Ouch, good catch. Thanks!

        Show
        jpountz Adrien Grand added a comment - Good point, here is an updated patch. I am getting the following indexing times, which suggests that it does not hurt: 278.064423505 sec to index part 0 // master 270.492947789 sec to index part 0 // patch The index size is also unchanged, which was expected since the previous heuristic should be equivalent with dense data. Also, I noticed TestBackwardsCompatibility seems not to test points! I'll go fix that ... Ouch, good catch. Thanks!
        Hide
        mikemccand Michael McCandless added a comment -

        +1, great!

        Show
        mikemccand Michael McCandless added a comment - +1, great!
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 866398bea67607bcd54331a48736e6bdb94a703d in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=866398b ]

        LUCENE-7371: Better compression of values in Lucene60PointsFormat.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 866398bea67607bcd54331a48736e6bdb94a703d in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=866398b ] LUCENE-7371 : Better compression of values in Lucene60PointsFormat.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1f446872aa9346c22643d0fb753ec42942b5a4d2 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f44687 ]

        LUCENE-7371: Better compression of values in Lucene60PointsFormat.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1f446872aa9346c22643d0fb753ec42942b5a4d2 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1f44687 ] LUCENE-7371 : Better compression of values in Lucene60PointsFormat.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1a6df249f91ca9f4dab792c48f5965f3388f1776 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1a6df24 ]

        LUCENE-7371: Fix CHANGES entry.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1a6df249f91ca9f4dab792c48f5965f3388f1776 in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=1a6df24 ] LUCENE-7371 : Fix CHANGES entry.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit b54d46722b36f107edd59a8d843b93f5727a9058 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b54d467 ]

        LUCENE-7371: Fix CHANGES entry.

        Show
        jira-bot ASF subversion and git services added a comment - Commit b54d46722b36f107edd59a8d843b93f5727a9058 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b54d467 ] LUCENE-7371 : Fix CHANGES entry.
        Hide
        jpountz Adrien Grand added a comment -

        Thanks Mike!

        Show
        jpountz Adrien Grand added a comment - Thanks Mike!
        Hide
        jpountz Adrien Grand added a comment -

        The benchmarks are reporting interesting changes, some seem to perform slightly faster now, like IntNRQ (http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html) or the geo3d distance filter (http://people.apache.org/~mikemccand/geobench.html#search-distance) but some others seem to perform a bit slower like the 10-gon filter (http://people.apache.org/~mikemccand/geobench.html#search-poly_10) or the 10 nearest points (http://people.apache.org/~mikemccand/geobench.html#search-nearest_10). The fact that it is not consistently slower or faster is due to the distribution of points in the blocks that need to be read I think (the more unique leading bytes, the more expensive the read). Given that the slow down is not general to all benchmarks and that the size reduction is significant I don't think this should be reverted, but let me know if you think otherwise. (For the record many benchmarks look slower on July 17th but I don't think this is related to this change, for instance even phrases got slower http://people.apache.org/~mikemccand/lucenebench/Phrase.html)

        Show
        jpountz Adrien Grand added a comment - The benchmarks are reporting interesting changes, some seem to perform slightly faster now, like IntNRQ ( http://people.apache.org/~mikemccand/lucenebench/IntNRQ.html ) or the geo3d distance filter ( http://people.apache.org/~mikemccand/geobench.html#search-distance ) but some others seem to perform a bit slower like the 10-gon filter ( http://people.apache.org/~mikemccand/geobench.html#search-poly_10 ) or the 10 nearest points ( http://people.apache.org/~mikemccand/geobench.html#search-nearest_10 ). The fact that it is not consistently slower or faster is due to the distribution of points in the blocks that need to be read I think (the more unique leading bytes, the more expensive the read). Given that the slow down is not general to all benchmarks and that the size reduction is significant I don't think this should be reverted, but let me know if you think otherwise. (For the record many benchmarks look slower on July 17th but I don't think this is related to this change, for instance even phrases got slower http://people.apache.org/~mikemccand/lucenebench/Phrase.html )
        Hide
        rcmuir Robert Muir added a comment -

        I think Michael McCandless may have upgraded his operating system.

        Show
        rcmuir Robert Muir added a comment - I think Michael McCandless may have upgraded his operating system.
        Hide
        mikemccand Michael McCandless added a comment -

        Oh sorry, I upgraded the Linux kernel from 4.4 -> 4.6.4 on 7/17! I'll add an annotation.

        Show
        mikemccand Michael McCandless added a comment - Oh sorry, I upgraded the Linux kernel from 4.4 -> 4.6.4 on 7/17! I'll add an annotation.
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            jpountz Adrien Grand
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development