Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9237

Faster TermsEnum intersect for UniformSplit

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      New version of TermsEnum intersect for UniformSplit. It is 75% more efficient than the previous version for FuzzyQuery.

      Compared to BlockTree IntersectTermsEnum:

      • It is still slower for FuzzyQuery (between -37% and -44% in our benchmarks) but it is faster than the previous version (which was -65%).
      • It is on par or slightly slower for WildcardQuery (between -5% and 0%).
      • It is slightly faster for PrefixQuery (between +5% and +10%).

       

      When I debugged thoroughly to understand what was the limitation of the previous approach we had (to compute the common prefix between two consecutive block keys in the FST), I saw that actually for all FuzzyQuery the common prefix matched so we entered all blocks.
      I realized that the FuzzyQuery automaton accepts many variations for the prefix, and the common prefix was not long enough to allow us to filter correctly.

      I looked at what VarGapFixedInterval did. It jumped all the time after each term to find the next target term accepted by the automaton. And this was sufficiently efficient thanks to a vital optimization that compared the target term to the immediate following term, to actually not jump most of the time.

      So I applied the same idea to compute the next accepted term and jump, but now with a first condition based on the number of consecutively rejected terms, and by anticipating the comparison of the accepted term with the immediate next term. This is the main factor of the improvement. We leverage also other optimizations that speed up the automaton validation of each sequential term in the block.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                broustant Bruno Roustant
                Reporter:
                broustant Bruno Roustant
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 20m
                  4h 20m