Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8836

Optimize DocValues TermsDict to continue scanning from the last position when possible

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Lucene Fields:
      New, Patch Available

      Description

      Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a term ordinal.

      Currently it does not have the optimization the FSTEnum has: to be able to continue a sequential scan from where the last lookup was in the IndexInput. For sparse lookups (when searching only a few terms or ordinal) it is not an issue. But for multiple lookups in a row this optimization could save re-scanning all the terms from the block start (since they are delat encoded).

      This patch proposes the optimization.

      To estimate the gain, we ran 3 Lucene tests while counting the seeks and the term reads in the IndexInput, with and without the optimization:

      TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term reads.
      TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
      TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 82% term reads.

      In some cases, when scanning many terms in lexicographical order, the optimization saves a lot. In some case, when only looking for some sparse terms, the optimization does not bring improvement, but does not penalize neither. It seems to be worth to always have it.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                bruno.roustant Bruno Roustant
              • Votes:
                2 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m