Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8836

Optimize DocValues TermsDict to continue scanning from the last position when possible

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.2
    • None
    • New, Patch Available

    Description

      Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a term ordinal.

      Currently it does not have the optimization the FSTEnum has: to be able to continue a sequential scan from where the last lookup was in the IndexInput. For sparse lookups (when searching only a few terms or ordinal) it is not an issue. But for multiple lookups in a row this optimization could save re-scanning all the terms from the block start (since they are delat encoded).

      This patch proposes the optimization.

      To estimate the gain, we ran 3 Lucene tests while counting the seeks and the term reads in the IndexInput, with and without the optimization:

      TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term reads.
      TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
      TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 82% term reads.

      In some cases, when scanning many terms in lexicographical order, the optimization saves a lot. In some case, when only looking for some sparse terms, the optimization does not bring improvement, but does not penalize neither. It seems to be worth to always have it.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              broustant Bruno Roustant
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 40m
                  2h 40m