Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4702

Terms dictionary compression

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Wish
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • None
    • 8.5
    • None
    • None
    • New

    Description

      I've done a quick test with the block tree terms dictionary by replacing a call to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC to test the peformance hit. Interestingly, search performance was very good (see comparison table below) and the tim files were 14% smaller (from 150432 bytes overall to 129516).

                          TaskQPS baseline      StdDevQPS compressed      StdDev                Pct diff
                        Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  -29.4% ( -32% -  -26%)
                        Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  -22.7% ( -26% -  -18%)
                       Respell      122.86      (2.1%)      103.89      (1.7%)  -15.4% ( -18% -  -11%)
                      Wildcard      100.58      (4.3%)       94.42      (3.2%)   -6.1% ( -13% -    1%)
                       Prefix3      124.90      (5.7%)      122.67      (4.7%)   -1.8% ( -11% -    9%)
                     OrHighLow      169.87      (6.8%)      167.77      (8.0%)   -1.2% ( -15% -   14%)
                       LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   -1.1% (  -8% -    7%)
                    AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   -1.0% (  -7% -    5%)
                    OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   -1.0% ( -14% -   14%)
                   AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   -0.7% (  -3% -    2%)
                     OrHighMed      217.55      (6.3%)      216.16      (7.1%)   -0.6% ( -13% -   13%)
                        IntNRQ       53.10     (10.9%)       52.90      (8.6%)   -0.4% ( -17% -   21%)
                       MedTerm      998.11      (3.8%)      994.82      (5.6%)   -0.3% (  -9% -    9%)
                   MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   -0.2% (  -8% -    8%)
                  HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   -0.1% (  -9% -    9%)
                   LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   -0.1% (  -7% -    7%)
                    AndHighMed      366.18      (1.7%)      366.93      (1.7%)    0.2% (  -3% -    3%)
                      PKLookup      237.28      (4.0%)      237.96      (4.2%)    0.3% (  -7% -    8%)
                     MedPhrase      173.17      (4.7%)      174.69      (4.7%)    0.9% (  -8% -   10%)
               LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    1.0% (  -4% -    6%)
                     LowPhrase      374.64      (5.5%)      379.11      (5.8%)    1.2% (  -9% -   13%)
                      HighTerm      253.14      (7.9%)      256.97     (11.4%)    1.5% ( -16% -   22%)
                    HighPhrase       19.52     (10.6%)       19.83     (11.0%)    1.6% ( -18% -   25%)
               MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    1.6% (  -3% -    6%)
              HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    2.8% (  -6% -   13%)
      

      Only queries which are very terms-dictionary-intensive got a performance hit (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved (surprisingly) well.

      Do you think of it as something worth exploring?

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jpountz Adrien Grand
            jpountz Adrien Grand
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - Not Specified
              Not Specified
              Remaining:
              Remaining Estimate - 0h
              0h
              Logged:
              Time Spent - 3h 50m
              3h 50m

              Slack

                Issue deployment