Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4702

Terms dictionary compression

Details

    • Wish
    • Status: Closed
    • Trivial
    • Resolution: Fixed
    • None
    • 8.5
    • None
    • None
    • New

    Description

      I've done a quick test with the block tree terms dictionary by replacing a call to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC to test the peformance hit. Interestingly, search performance was very good (see comparison table below) and the tim files were 14% smaller (from 150432 bytes overall to 129516).

                          TaskQPS baseline      StdDevQPS compressed      StdDev                Pct diff
                        Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  -29.4% ( -32% -  -26%)
                        Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  -22.7% ( -26% -  -18%)
                       Respell      122.86      (2.1%)      103.89      (1.7%)  -15.4% ( -18% -  -11%)
                      Wildcard      100.58      (4.3%)       94.42      (3.2%)   -6.1% ( -13% -    1%)
                       Prefix3      124.90      (5.7%)      122.67      (4.7%)   -1.8% ( -11% -    9%)
                     OrHighLow      169.87      (6.8%)      167.77      (8.0%)   -1.2% ( -15% -   14%)
                       LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   -1.1% (  -8% -    7%)
                    AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   -1.0% (  -7% -    5%)
                    OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   -1.0% ( -14% -   14%)
                   AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   -0.7% (  -3% -    2%)
                     OrHighMed      217.55      (6.3%)      216.16      (7.1%)   -0.6% ( -13% -   13%)
                        IntNRQ       53.10     (10.9%)       52.90      (8.6%)   -0.4% ( -17% -   21%)
                       MedTerm      998.11      (3.8%)      994.82      (5.6%)   -0.3% (  -9% -    9%)
                   MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   -0.2% (  -8% -    8%)
                  HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   -0.1% (  -9% -    9%)
                   LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   -0.1% (  -7% -    7%)
                    AndHighMed      366.18      (1.7%)      366.93      (1.7%)    0.2% (  -3% -    3%)
                      PKLookup      237.28      (4.0%)      237.96      (4.2%)    0.3% (  -7% -    8%)
                     MedPhrase      173.17      (4.7%)      174.69      (4.7%)    0.9% (  -8% -   10%)
               LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    1.0% (  -4% -    6%)
                     LowPhrase      374.64      (5.5%)      379.11      (5.8%)    1.2% (  -9% -   13%)
                      HighTerm      253.14      (7.9%)      256.97     (11.4%)    1.5% ( -16% -   22%)
                    HighPhrase       19.52     (10.6%)       19.83     (11.0%)    1.6% ( -18% -   25%)
               MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    1.6% (  -3% -    6%)
              HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    2.8% (  -6% -   13%)
      

      Only queries which are very terms-dictionary-intensive got a performance hit (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved (surprisingly) well.

      Do you think of it as something worth exploring?

      Attachments

        1. LUCENE-4702.patch
          6 kB
          Adrien Grand
        2. LUCENE-4702.patch
          5 kB
          Adrien Grand

        Activity

          People

            jpountz Adrien Grand
            jpountz Adrien Grand
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 3h 50m
                3h 50m