Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7589

Prevent outliers from raising the number of bits of everyone with numeric doc values

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Today we encode entire segments with a single number of bits per value. It was done this way because it was faster, but it also means a single outlier can significantly increase the space requirements. I think we should have protection against that.

      1. LUCENE-7589.patch
        30 kB
        Adrien Grand

        Activity

        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch refs/heads/feature/metrics from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ]

        LUCENE-7589: Prevent outliers from raising the bpv for everyone.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch refs/heads/feature/metrics from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ] LUCENE-7589 : Prevent outliers from raising the bpv for everyone.
        Hide
        jpountz Adrien Grand added a comment -

        Like Mike predicted, this helped the NYC taxi bench a bit, Disk usage for the dropoff datetime field went from 194MB to 166MB: http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#index_size_by_field

        Show
        jpountz Adrien Grand added a comment - Like Mike predicted, this helped the NYC taxi bench a bit, Disk usage for the dropoff datetime field went from 194MB to 166MB: http://people.apache.org/~mikemccand/lucenebench/sparseResults.html#index_size_by_field
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch refs/heads/master from Adrien Grand
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ]

        LUCENE-7589: Prevent outliers from raising the bpv for everyone.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 3b182aa2fb3e4062f6ec5be819f3aa70aa2e523d in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3b182aa ] LUCENE-7589 : Prevent outliers from raising the bpv for everyone.
        Hide
        mikemccand Michael McCandless added a comment -

        However if you add a new field that stores the average number of miles per hour as a long doc values field, then it highlights the quality issues of this dataset and disk usage for this field goes from 40 to 15.7 bits per value (-60%) with the patch.

        Ahhh, I see! The taxis that go faster than the speed of light are not apparent now since we don't store that field directly... makes sense.

        Show
        mikemccand Michael McCandless added a comment - However if you add a new field that stores the average number of miles per hour as a long doc values field, then it highlights the quality issues of this dataset and disk usage for this field goes from 40 to 15.7 bits per value (-60%) with the patch. Ahhh, I see! The taxis that go faster than the speed of light are not apparent now since we don't store that field directly... makes sense.
        Hide
        jpountz Adrien Grand added a comment -

        Thanks Mike for having a look. The patch does not show much reduction in spite of the quality issues of the dataset since existing fields tend to not have outliers. However if you add a new field that stores the average number of miles per hour as a long doc values field, then it highlights the quality issues of this dataset and disk usage for this field goes from 40 to 15.7 bits per value (-60%) with the patch.

        Show
        jpountz Adrien Grand added a comment - Thanks Mike for having a look. The patch does not show much reduction in spite of the quality issues of the dataset since existing fields tend to not have outliers. However if you add a new field that stores the average number of miles per hour as a long doc values field, then it highlights the quality issues of this dataset and disk usage for this field goes from 40 to 15.7 bits per value (-60%) with the patch.
        Hide
        mikemccand Michael McCandless added a comment -

        The patch looks great, just this minor typo:

        {{ values for te next block.}} --> {{ values for the next block.}}

        This seems to give ~3.7% reduction in the doc values disk used for sparse taxis!

        Show
        mikemccand Michael McCandless added a comment - The patch looks great, just this minor typo: {{ values for te next block.}} --> {{ values for the next block.}} This seems to give ~3.7% reduction in the doc values disk used for sparse taxis!
        Hide
        jpountz Adrien Grand added a comment -

        Here is a patch. The doc values consumer computes space usage both for the case that all values use the same number of bits per value and for the case that values are split into blocks of 16384 values. And if using blocks proves to save 10% disk usage or more, then it encodes blocks with their own required number of bits per value.

        I kept a rather high value of the block size, since this impl can only jump forward blockSize documents at a time, so a high value like 16384 hopefully keeps performance good, but in the future we might want to look into leveraging the sequential access pattern even more (to do run-length encoding for instance) and maybe have eg. a skip list to handle the big jumps, like postings do. I think that patch is a good first (baby) step towards that direction.

        Show
        jpountz Adrien Grand added a comment - Here is a patch. The doc values consumer computes space usage both for the case that all values use the same number of bits per value and for the case that values are split into blocks of 16384 values. And if using blocks proves to save 10% disk usage or more, then it encodes blocks with their own required number of bits per value. I kept a rather high value of the block size, since this impl can only jump forward blockSize documents at a time, so a high value like 16384 hopefully keeps performance good, but in the future we might want to look into leveraging the sequential access pattern even more (to do run-length encoding for instance) and maybe have eg. a skip list to handle the big jumps, like postings do. I think that patch is a good first (baby) step towards that direction.

          People

          • Assignee:
            jpountz Adrien Grand
            Reporter:
            jpountz Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development