[LUCENE-4702] Terms dictionary compression - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Trivial
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.5
Component/s: None
Labels:
None

Lucene Fields:

New

Description

I've done a quick test with the block tree terms dictionary by replacing a call to IndexOutput.writeBytes to write suffix bytes with a call to LZ4.compressHC to test the peformance hit. Interestingly, search performance was very good (see comparison table below) and the tim files were 14% smaller (from 150432 bytes overall to 129516).

                    TaskQPS baseline      StdDevQPS compressed      StdDev                Pct diff
                  Fuzzy1      111.50      (2.0%)       78.78      (1.5%)  -29.4% ( -32% -  -26%)
                  Fuzzy2       36.99      (2.7%)       28.59      (1.5%)  -22.7% ( -26% -  -18%)
                 Respell      122.86      (2.1%)      103.89      (1.7%)  -15.4% ( -18% -  -11%)
                Wildcard      100.58      (4.3%)       94.42      (3.2%)   -6.1% ( -13% -    1%)
                 Prefix3      124.90      (5.7%)      122.67      (4.7%)   -1.8% ( -11% -    9%)
               OrHighLow      169.87      (6.8%)      167.77      (8.0%)   -1.2% ( -15% -   14%)
                 LowTerm     1949.85      (4.5%)     1929.02      (3.4%)   -1.1% (  -8% -    7%)
              AndHighLow     2011.95      (3.5%)     1991.85      (3.3%)   -1.0% (  -7% -    5%)
              OrHighHigh      155.63      (6.7%)      154.12      (7.9%)   -1.0% ( -14% -   14%)
             AndHighHigh      341.82      (1.2%)      339.49      (1.7%)   -0.7% (  -3% -    2%)
               OrHighMed      217.55      (6.3%)      216.16      (7.1%)   -0.6% ( -13% -   13%)
                  IntNRQ       53.10     (10.9%)       52.90      (8.6%)   -0.4% ( -17% -   21%)
                 MedTerm      998.11      (3.8%)      994.82      (5.6%)   -0.3% (  -9% -    9%)
             MedSpanNear       60.50      (3.7%)       60.36      (4.8%)   -0.2% (  -8% -    8%)
            HighSpanNear       19.74      (4.5%)       19.72      (5.1%)   -0.1% (  -9% -    9%)
             LowSpanNear      101.93      (3.2%)      101.82      (4.4%)   -0.1% (  -7% -    7%)
              AndHighMed      366.18      (1.7%)      366.93      (1.7%)    0.2% (  -3% -    3%)
                PKLookup      237.28      (4.0%)      237.96      (4.2%)    0.3% (  -7% -    8%)
               MedPhrase      173.17      (4.7%)      174.69      (4.7%)    0.9% (  -8% -   10%)
         LowSloppyPhrase      180.91      (2.6%)      182.79      (2.7%)    1.0% (  -4% -    6%)
               LowPhrase      374.64      (5.5%)      379.11      (5.8%)    1.2% (  -9% -   13%)
                HighTerm      253.14      (7.9%)      256.97     (11.4%)    1.5% ( -16% -   22%)
              HighPhrase       19.52     (10.6%)       19.83     (11.0%)    1.6% ( -18% -   25%)
         MedSloppyPhrase      141.90      (2.6%)      144.11      (2.5%)    1.6% (  -3% -    6%)
        HighSloppyPhrase       25.26      (4.8%)       25.97      (5.0%)    2.8% (  -6% -   13%)

Only queries which are very terms-dictionary-intensive got a performance hit (Fuzzy, Fuzzy2, Respell, Wildcard), other queries including Prefix3 behaved (surprisingly) well.

Do you think of it as something worth exploring?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-4702.patch
21/Jan/13 13:29
5 kB
Adrien Grand
LUCENE-4702.patch
29/Jan/13 12:02
6 kB
Adrien Grand

Issue Links

links to

GitHub PR

GitHub Pull Request #1126

GitHub Pull Request #1216

Activity

People

Assignee:: Adrien Grand

Reporter:: Adrien Grand

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Jan/13 13:26

Updated:: 28/Aug/22 13:36

Resolved:: 30/Jan/20 09:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 50m