Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7351

BKDWriter should compress doc ids when all values in a block are the same

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.2
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      BKDWriter writes doc ids using 4 bytes per document. I think it should compress similarly to postings when all docs in a block have the same packed value. This can happen either when a field has a default value which is common across documents or when quantization makes the number of unique values so small that a large index will necessarily have blocks that all contain the same value (eg. there are only 63490 unique half-float values).

      1. LUCENE-7351.patch
        14 kB
        Adrien Grand
      2. LUCENE-7351.patch
        17 kB
        Adrien Grand
      3. LUCENE-7351.patch
        19 kB
        Adrien Grand

        Activity

        Hide
        jpountz Adrien Grand added a comment -

        I have been experimenting with the attached patch, which compresses doc ids based on the number of required bytes to store them (it only specializes 8, 16, 24 and 32 bits per doc id) and also adds delta-compression for blocks whose values are all the same. The IndexAndSearchOpenStreetMaps reported a slow down of 1.7% for the box benchmark (72.3 QPS -> 71.1 QPS) but storage requirements decreased by 9.1% (635MB -> 577MB). The storage requirements improve even more with types that require fewer bytes (LatLonPoint requires 8 bytes per value). For instance indexing 10M random half floats with the patch requires 28MB vs 43MB on master (-35%).

        Show
        jpountz Adrien Grand added a comment - I have been experimenting with the attached patch, which compresses doc ids based on the number of required bytes to store them (it only specializes 8, 16, 24 and 32 bits per doc id) and also adds delta-compression for blocks whose values are all the same. The IndexAndSearchOpenStreetMaps reported a slow down of 1.7% for the box benchmark (72.3 QPS -> 71.1 QPS) but storage requirements decreased by 9.1% (635MB -> 577MB). The storage requirements improve even more with types that require fewer bytes (LatLonPoint requires 8 bytes per value). For instance indexing 10M random half floats with the patch requires 28MB vs 43MB on master (-35%).
        Hide
        jpountz Adrien Grand added a comment -

        Updated patch. It now specializes both reading doc ids into an array and feeding a visitor, which seems to help get the performance back to what it is on master, or at least less than 1% slower (not easy to distinguish minor slowdowns to noise at this stage).

        It has 3 cases:

        • increasing doc ids, which is expected to happen for either sorted segments or when all docs in a block have the same value. In that case, we delta-encode using vints.
        • doc ids requiring less than 24 bits, which are encoded on 3 bytes.
        • doc ids requiring less than 32 bits, which are encoded on 4 bytes like on master today.

        I think it's ready to go?

        Show
        jpountz Adrien Grand added a comment - Updated patch. It now specializes both reading doc ids into an array and feeding a visitor, which seems to help get the performance back to what it is on master, or at least less than 1% slower (not easy to distinguish minor slowdowns to noise at this stage). It has 3 cases: increasing doc ids, which is expected to happen for either sorted segments or when all docs in a block have the same value. In that case, we delta-encode using vints. doc ids requiring less than 24 bits, which are encoded on 3 bytes. doc ids requiring less than 32 bits, which are encoded on 4 bytes like on master today. I think it's ready to go?
        Hide
        rcmuir Robert Muir added a comment -

        I like this better than the last patch, I think the optimization is more general.

        I think in the base test class, tesMostEqual() is a mistake?

        Show
        rcmuir Robert Muir added a comment - I like this better than the last patch, I think the optimization is more general. I think in the base test class, tesMostEqual() is a mistake?
        Hide
        jpountz Adrien Grand added a comment -

        Hmm I can remove both actually, they do not bring value now that the detection of whether doc ids are sorted is based on the doc ids themselves rather than the fact that there is a single value in a block.

        Show
        jpountz Adrien Grand added a comment - Hmm I can remove both actually, they do not bring value now that the detection of whether doc ids are sorted is based on the doc ids themselves rather than the fact that there is a single value in a block.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 01de73bc0a1b315bbbe4df046b5c0661cec8de2e in lucene-solr's branch refs/heads/branch_6x from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=01de73b ]

        LUCENE-7351: Doc id compression for points.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 01de73bc0a1b315bbbe4df046b5c0661cec8de2e in lucene-solr's branch refs/heads/branch_6x from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=01de73b ] LUCENE-7351 : Doc id compression for points.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit d66e9935c39ed859659de46d3d5cfb66f2279bd4 in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d66e993 ]

        LUCENE-7351: Doc id compression for points.

        Show
        jira-bot ASF subversion and git services added a comment - Commit d66e9935c39ed859659de46d3d5cfb66f2279bd4 in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d66e993 ] LUCENE-7351 : Doc id compression for points.
        Hide
        mikemccand Michael McCandless added a comment -

        Bulk close resolved issues after 6.2.0 release.

        Show
        mikemccand Michael McCandless added a comment - Bulk close resolved issues after 6.2.0 release.

          People

          • Assignee:
            Unassigned
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development