Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7457

Default doc values format should optimize for iterator access

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In LUCENE-7407 we switched doc values consumption from random access API to an iterator API, but nothing was done there to improve the codec. We should do that here.

      At a bare minimum we should fix the existing very-sparse case to be a true iterator, and not wrapped with the silly legacy wrappers.

      I think we should also increase the threshold (currently 1%?) when we switch from dense to sparse encoding. This should fix LUCENE-7253, making merging of sparse doc values efficient ("pay for what you use").

      I'm sure there are many other things to explore to let codecs "take advantage" of the fact that they no longer need to offer random access to doc values.

      1. LUCENE-7457.patch
        17 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          jpountz Adrien Grand added a comment -

          Here is a patch implementing what Mike describes above as the bare minimum. I'm not sure it is worth spending too much time on this since we will probably want to build a new DV format that better takes advantage of the iterator-style API until 7.0 is released?

          Show
          jpountz Adrien Grand added a comment - Here is a patch implementing what Mike describes above as the bare minimum. I'm not sure it is worth spending too much time on this since we will probably want to build a new DV format that better takes advantage of the iterator-style API until 7.0 is released?
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Adrien Grand, this looks great! Should we also increase the sparse threshold (currently 1%) when writing doc values? Or we can wait for a followon issue...

          Show
          mikemccand Michael McCandless added a comment - Thanks Adrien Grand , this looks great! Should we also increase the sparse threshold (currently 1%) when writing doc values? Or we can wait for a followon issue...
          Hide
          jpountz Adrien Grand added a comment -

          I don't mind increasing it to something like 10%. However I hope this will never be useful and we will write a DV format that better takes advantage of the iterator-style API before 7.0 is released?

          Show
          jpountz Adrien Grand added a comment - I don't mind increasing it to something like 10%. However I hope this will never be useful and we will write a DV format that better takes advantage of the iterator-style API before 7.0 is released?
          Hide
          jpountz Adrien Grand added a comment -

          Something to be aware of when increasing it is that in the case that values require few bits (eg. an enum or a boolean field), the doc ids can quickly start to use significant disk space and could make doc values use more disk space than when they were densely encoded.

          Show
          jpountz Adrien Grand added a comment - Something to be aware of when increasing it is that in the case that values require few bits (eg. an enum or a boolean field), the doc ids can quickly start to use significant disk space and could make doc values use more disk space than when they were densely encoded.
          Hide
          mikemccand Michael McCandless added a comment -

          OK let's leave it at 1% for this issue?

          Show
          mikemccand Michael McCandless added a comment - OK let's leave it at 1% for this issue?
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 2f88bc80c2c1afed975199adb3f340fcec8179aa in lucene-solr's branch refs/heads/master from Adrien Grand
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f88bc8 ]

          LUCENE-7457: Make Lucene54DocValuesFormat's sparse case actually implement an iterator.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 2f88bc80c2c1afed975199adb3f340fcec8179aa in lucene-solr's branch refs/heads/master from Adrien Grand [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=2f88bc8 ] LUCENE-7457 : Make Lucene54DocValuesFormat's sparse case actually implement an iterator.
          Hide
          jpountz Adrien Grand added a comment -

          +1 I'll consider bumping it on LUCENE-7463.

          Show
          jpountz Adrien Grand added a comment - +1 I'll consider bumping it on LUCENE-7463 .

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development