Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Lucene42DVConsumer's ctor takes acceptableOverheadRatio, so that you can tradeoff time/space, and we pass PackedInts.FASTEST so we always use 8 bits per value.

      But the class is package private, so if I want to make my own NormsFormat and pass e.g. PackedInts.COMPACT, I can't ... I think we should make this class public / @experimental?

        Activity

        Hide
        Robert Muir added a comment -

        I made a patch for this actually, ill find it or redo it.

        I think actually the good one to pick is PackedInts.DEFAULT or FAST (I forget which). for values < 255 one of these guarantees you get the fast 3-block one it doesnt actually waste so much space and its much much faster than COMPACT.

        Show
        Robert Muir added a comment - I made a patch for this actually, ill find it or redo it. I think actually the good one to pick is PackedInts.DEFAULT or FAST (I forget which). for values < 255 one of these guarantees you get the fast 3-block one it doesnt actually waste so much space and its much much faster than COMPACT.
        Hide
        Robert Muir added a comment -

        i meant PACKED_SINGLE_BLOCK

        Show
        Robert Muir added a comment - i meant PACKED_SINGLE_BLOCK
        Hide
        Robert Muir added a comment -

        Untested patch. I think this way you can just do:

            new FilterCodec("MyCodec", new Lucene42Codec()) {
              @Override
              public NormsFormat normsFormat() {
                return new Lucene42NormsFormat(PackedInts.DEFAULT);
              }
            };
        

        I like it better than my previous idea of making Consumer/Producer public, because it exposes much less surface area and is easier to use...what do you think?

        Show
        Robert Muir added a comment - Untested patch. I think this way you can just do: new FilterCodec( "MyCodec" , new Lucene42Codec()) { @Override public NormsFormat normsFormat() { return new Lucene42NormsFormat(PackedInts.DEFAULT); } }; I like it better than my previous idea of making Consumer/Producer public, because it exposes much less surface area and is easier to use...what do you think?
        Hide
        Adrien Grand added a comment -

        I like it better than my previous idea of making Consumer/Producer public, because it exposes much less surface area and is easier to use...what do you think?

        +1

        Show
        Adrien Grand added a comment - I like it better than my previous idea of making Consumer/Producer public, because it exposes much less surface area and is easier to use...what do you think? +1
        Hide
        Adrien Grand added a comment -

        I ran again the WIKI_MEDIUM_1M benchmark with various norms formats, and Lucene42NormsFormat with PackedInts.DEFAULT doesn't look bad:

        Default norms format: 1991830 bytes of norms
        
        Lucene42NormsFormat(PackedInts.DEFAULT) 909910 bytes of norms
        
                            Task   QPS trunk      StdDevQPS packed norms      StdDev                Pct diff
                        HighTerm      758.15      (6.4%)      643.01      (7.5%)  -15.2% ( -27% -   -1%)
                      OrHighHigh      296.86     (10.3%)      280.84     (10.6%)   -5.4% ( -23% -   17%)
                       OrHighMed      218.24     (10.7%)      209.35     (10.9%)   -4.1% ( -23% -   19%)
                          Fuzzy2      140.18      (4.0%)      135.14      (5.3%)   -3.6% ( -12% -    5%)
                         MedTerm     1578.99      (7.4%)     1546.60      (4.8%)   -2.1% ( -13% -   10%)
                      HighPhrase      160.42      (6.6%)      157.22      (4.0%)   -2.0% ( -11% -    9%)
                       OrHighLow      552.01      (9.9%)      543.15     (10.8%)   -1.6% ( -20% -   21%)
                        PKLookup      386.15      (5.4%)      382.35      (4.5%)   -1.0% ( -10% -    9%)
                     MedSpanNear      135.61      (3.5%)      134.41      (4.1%)   -0.9% (  -8% -    7%)
                    HighSpanNear       10.72      (3.2%)       10.63      (2.2%)   -0.8% (  -6% -    4%)
                HighSloppyPhrase       47.29      (4.3%)       47.09      (5.0%)   -0.4% (  -9% -    9%)
                     LowSpanNear       63.62      (3.4%)       63.83      (4.1%)    0.3% (  -6% -    8%)
                         Respell      117.48      (4.8%)      118.03      (4.2%)    0.5% (  -8% -    9%)
                        Wildcard      288.18      (4.0%)      289.88      (4.3%)    0.6% (  -7% -    9%)
                     AndHighHigh      478.72      (3.7%)      481.87      (3.2%)    0.7% (  -6% -    7%)
                         Prefix3     1399.57      (3.8%)     1410.64      (6.0%)    0.8% (  -8% -   10%)
                 MedSloppyPhrase      233.10      (3.8%)      235.37      (4.2%)    1.0% (  -6% -    9%)
                      AndHighMed      751.65      (3.7%)      759.12      (4.7%)    1.0% (  -7% -    9%)
                       MedPhrase      119.14      (5.2%)      120.52      (4.7%)    1.2% (  -8% -   11%)
                          Fuzzy1      142.29      (3.7%)      144.50      (4.5%)    1.6% (  -6% -   10%)
                      AndHighLow     2365.88      (6.6%)     2407.32      (4.7%)    1.8% (  -8% -   13%)
                       LowPhrase      256.84      (4.3%)      262.04      (2.6%)    2.0% (  -4% -    9%)
                 LowSloppyPhrase      313.62      (2.9%)      321.21      (3.5%)    2.4% (  -3% -    9%)
                          IntNRQ      117.27      (7.1%)      121.22     (11.0%)    3.4% ( -13% -   23%)
                         LowTerm     2760.64      (4.5%)     2907.64      (6.8%)    5.3% (  -5% -   17%)
        
        
        
        Lucene42NormsFormat(PackedInts.DEFAULT) 896406 bytes of norms
                            
                            Task   QPS trunk      StdDevQPS packed norms      StdDev                Pct diff
                        HighTerm      698.74      (9.5%)      607.43      (8.0%)  -13.1% ( -27% -    4%)
                      OrHighHigh      247.01      (6.3%)      216.49      (5.8%)  -12.4% ( -23% -    0%)
                       OrHighMed      339.84      (6.1%)      301.83      (7.1%)  -11.2% ( -23% -    2%)
                       OrHighLow      385.26      (5.6%)      342.81      (7.5%)  -11.0% ( -22% -    2%)
                         MedTerm     1100.36     (10.0%)      983.30      (7.5%)  -10.6% ( -25% -    7%)
                      HighPhrase      181.74      (8.1%)      176.96      (5.9%)   -2.6% ( -15% -   12%)
                          Fuzzy1      157.29      (5.1%)      154.49      (4.7%)   -1.8% ( -10% -    8%)
                    HighSpanNear       34.67      (3.6%)       34.13      (2.5%)   -1.5% (  -7% -    4%)
                         Prefix3      437.45      (6.1%)      431.17      (6.0%)   -1.4% ( -12% -   11%)
                HighSloppyPhrase        5.96      (4.1%)        5.91      (2.7%)   -0.8% (  -7% -    6%)
                 MedSloppyPhrase      264.84      (4.2%)      262.92      (4.9%)   -0.7% (  -9% -    8%)
                         Respell      194.30      (5.8%)      192.95      (4.3%)   -0.7% ( -10% -    9%)
                       MedPhrase      132.99      (5.6%)      132.37      (5.2%)   -0.5% ( -10% -   10%)
                        Wildcard      235.47      (4.8%)      235.00      (4.5%)   -0.2% (  -9% -    9%)
                     AndHighHigh      338.04      (3.3%)      337.96      (2.4%)   -0.0% (  -5% -    5%)
                       LowPhrase      353.22      (6.9%)      353.80      (5.3%)    0.2% ( -11% -   13%)
                     LowSpanNear       79.68      (3.6%)       79.98      (4.5%)    0.4% (  -7% -    8%)
                          Fuzzy2       79.15      (6.6%)       79.49      (5.6%)    0.4% ( -11% -   13%)
                        PKLookup      387.23      (6.7%)      389.36      (4.5%)    0.5% ( -10% -   12%)
                 LowSloppyPhrase      649.88      (2.7%)      655.05      (4.2%)    0.8% (  -5% -    7%)
                          IntNRQ      191.57      (7.7%)      195.08      (9.8%)    1.8% ( -14% -   20%)
                      AndHighLow     2025.29      (7.1%)     2065.03      (6.4%)    2.0% ( -10% -   16%)
                     MedSpanNear      415.85      (4.5%)      426.71      (4.0%)    2.6% (  -5% -   11%)
                      AndHighMed      956.96      (5.4%)      990.30      (6.6%)    3.5% (  -8% -   16%)
                         LowTerm     2644.68      (7.4%)     2745.68      (8.1%)    3.8% ( -10% -   20%)
        
        DiskNormsFormat (same as DiskDVF but for norms): 896314 bytes of norms
        
                            Task   QPS trunk      StdDevQPS packed norms      StdDev                Pct diff
                        HighTerm      359.42     (12.9%)      204.00      (2.5%)  -43.2% ( -51% -  -32%)
                      OrHighHigh      269.86      (7.4%)      177.72      (4.1%)  -34.1% ( -42% -  -24%)
                       OrHighLow      358.36      (8.1%)      238.59      (4.1%)  -33.4% ( -42% -  -23%)
                       OrHighMed      305.65      (8.6%)      207.21      (4.7%)  -32.2% ( -41% -  -20%)
                         MedTerm     1342.66      (9.2%)      913.30      (3.4%)  -32.0% ( -40% -  -21%)
                         LowTerm     2849.62     (10.9%)     2449.59      (5.4%)  -14.0% ( -27% -    2%)
                     AndHighHigh      278.22      (3.8%)      249.40      (2.4%)  -10.4% ( -15% -   -4%)
                      HighPhrase      141.20      (6.5%)      131.19      (4.3%)   -7.1% ( -16% -    3%)
                      AndHighMed      410.39      (3.5%)      399.99      (3.1%)   -2.5% (  -8% -    4%)
                    HighSpanNear       42.28      (2.7%)       41.21      (2.8%)   -2.5% (  -7% -    3%)
                      AndHighLow     1932.50      (8.4%)     1895.71      (8.0%)   -1.9% ( -16% -   15%)
                          Fuzzy1      171.83      (4.0%)      168.69      (4.3%)   -1.8% (  -9% -    6%)
                          Fuzzy2       47.29      (4.1%)       46.75      (3.1%)   -1.1% (  -7% -    6%)
                        Wildcard      441.76      (4.8%)      437.28      (4.8%)   -1.0% ( -10% -    8%)
                         Respell      133.99      (3.7%)      132.66      (2.8%)   -1.0% (  -7% -    5%)
                          IntNRQ      125.99      (8.7%)      125.24      (7.5%)   -0.6% ( -15% -   17%)
                     MedSpanNear      107.53      (3.2%)      107.04      (4.9%)   -0.5% (  -8% -    7%)
                         Prefix3      570.56      (4.7%)      568.06      (4.9%)   -0.4% (  -9% -    9%)
                 MedSloppyPhrase      247.61      (4.4%)      249.33      (3.6%)    0.7% (  -7% -    9%)
                       LowPhrase      223.67      (3.7%)      225.77      (3.9%)    0.9% (  -6% -    8%)
                HighSloppyPhrase       46.13      (4.8%)       46.68      (5.9%)    1.2% (  -9% -   12%)
                        PKLookup      381.14      (2.5%)      385.72      (4.3%)    1.2% (  -5% -    8%)
                     LowSpanNear      109.87      (3.6%)      111.83      (4.7%)    1.8% (  -6% -   10%)
                 LowSloppyPhrase      179.23      (3.3%)      184.36      (4.2%)    2.9% (  -4% -   10%)
                       MedPhrase      202.33      (3.0%)      208.91      (4.0%)    3.3% (  -3% -   10%)
        
        Show
        Adrien Grand added a comment - I ran again the WIKI_MEDIUM_1M benchmark with various norms formats, and Lucene42NormsFormat with PackedInts.DEFAULT doesn't look bad: Default norms format: 1991830 bytes of norms Lucene42NormsFormat(PackedInts.DEFAULT) 909910 bytes of norms Task QPS trunk StdDevQPS packed norms StdDev Pct diff HighTerm 758.15 (6.4%) 643.01 (7.5%) -15.2% ( -27% - -1%) OrHighHigh 296.86 (10.3%) 280.84 (10.6%) -5.4% ( -23% - 17%) OrHighMed 218.24 (10.7%) 209.35 (10.9%) -4.1% ( -23% - 19%) Fuzzy2 140.18 (4.0%) 135.14 (5.3%) -3.6% ( -12% - 5%) MedTerm 1578.99 (7.4%) 1546.60 (4.8%) -2.1% ( -13% - 10%) HighPhrase 160.42 (6.6%) 157.22 (4.0%) -2.0% ( -11% - 9%) OrHighLow 552.01 (9.9%) 543.15 (10.8%) -1.6% ( -20% - 21%) PKLookup 386.15 (5.4%) 382.35 (4.5%) -1.0% ( -10% - 9%) MedSpanNear 135.61 (3.5%) 134.41 (4.1%) -0.9% ( -8% - 7%) HighSpanNear 10.72 (3.2%) 10.63 (2.2%) -0.8% ( -6% - 4%) HighSloppyPhrase 47.29 (4.3%) 47.09 (5.0%) -0.4% ( -9% - 9%) LowSpanNear 63.62 (3.4%) 63.83 (4.1%) 0.3% ( -6% - 8%) Respell 117.48 (4.8%) 118.03 (4.2%) 0.5% ( -8% - 9%) Wildcard 288.18 (4.0%) 289.88 (4.3%) 0.6% ( -7% - 9%) AndHighHigh 478.72 (3.7%) 481.87 (3.2%) 0.7% ( -6% - 7%) Prefix3 1399.57 (3.8%) 1410.64 (6.0%) 0.8% ( -8% - 10%) MedSloppyPhrase 233.10 (3.8%) 235.37 (4.2%) 1.0% ( -6% - 9%) AndHighMed 751.65 (3.7%) 759.12 (4.7%) 1.0% ( -7% - 9%) MedPhrase 119.14 (5.2%) 120.52 (4.7%) 1.2% ( -8% - 11%) Fuzzy1 142.29 (3.7%) 144.50 (4.5%) 1.6% ( -6% - 10%) AndHighLow 2365.88 (6.6%) 2407.32 (4.7%) 1.8% ( -8% - 13%) LowPhrase 256.84 (4.3%) 262.04 (2.6%) 2.0% ( -4% - 9%) LowSloppyPhrase 313.62 (2.9%) 321.21 (3.5%) 2.4% ( -3% - 9%) IntNRQ 117.27 (7.1%) 121.22 (11.0%) 3.4% ( -13% - 23%) LowTerm 2760.64 (4.5%) 2907.64 (6.8%) 5.3% ( -5% - 17%) Lucene42NormsFormat(PackedInts.DEFAULT) 896406 bytes of norms Task QPS trunk StdDevQPS packed norms StdDev Pct diff HighTerm 698.74 (9.5%) 607.43 (8.0%) -13.1% ( -27% - 4%) OrHighHigh 247.01 (6.3%) 216.49 (5.8%) -12.4% ( -23% - 0%) OrHighMed 339.84 (6.1%) 301.83 (7.1%) -11.2% ( -23% - 2%) OrHighLow 385.26 (5.6%) 342.81 (7.5%) -11.0% ( -22% - 2%) MedTerm 1100.36 (10.0%) 983.30 (7.5%) -10.6% ( -25% - 7%) HighPhrase 181.74 (8.1%) 176.96 (5.9%) -2.6% ( -15% - 12%) Fuzzy1 157.29 (5.1%) 154.49 (4.7%) -1.8% ( -10% - 8%) HighSpanNear 34.67 (3.6%) 34.13 (2.5%) -1.5% ( -7% - 4%) Prefix3 437.45 (6.1%) 431.17 (6.0%) -1.4% ( -12% - 11%) HighSloppyPhrase 5.96 (4.1%) 5.91 (2.7%) -0.8% ( -7% - 6%) MedSloppyPhrase 264.84 (4.2%) 262.92 (4.9%) -0.7% ( -9% - 8%) Respell 194.30 (5.8%) 192.95 (4.3%) -0.7% ( -10% - 9%) MedPhrase 132.99 (5.6%) 132.37 (5.2%) -0.5% ( -10% - 10%) Wildcard 235.47 (4.8%) 235.00 (4.5%) -0.2% ( -9% - 9%) AndHighHigh 338.04 (3.3%) 337.96 (2.4%) -0.0% ( -5% - 5%) LowPhrase 353.22 (6.9%) 353.80 (5.3%) 0.2% ( -11% - 13%) LowSpanNear 79.68 (3.6%) 79.98 (4.5%) 0.4% ( -7% - 8%) Fuzzy2 79.15 (6.6%) 79.49 (5.6%) 0.4% ( -11% - 13%) PKLookup 387.23 (6.7%) 389.36 (4.5%) 0.5% ( -10% - 12%) LowSloppyPhrase 649.88 (2.7%) 655.05 (4.2%) 0.8% ( -5% - 7%) IntNRQ 191.57 (7.7%) 195.08 (9.8%) 1.8% ( -14% - 20%) AndHighLow 2025.29 (7.1%) 2065.03 (6.4%) 2.0% ( -10% - 16%) MedSpanNear 415.85 (4.5%) 426.71 (4.0%) 2.6% ( -5% - 11%) AndHighMed 956.96 (5.4%) 990.30 (6.6%) 3.5% ( -8% - 16%) LowTerm 2644.68 (7.4%) 2745.68 (8.1%) 3.8% ( -10% - 20%) DiskNormsFormat (same as DiskDVF but for norms): 896314 bytes of norms Task QPS trunk StdDevQPS packed norms StdDev Pct diff HighTerm 359.42 (12.9%) 204.00 (2.5%) -43.2% ( -51% - -32%) OrHighHigh 269.86 (7.4%) 177.72 (4.1%) -34.1% ( -42% - -24%) OrHighLow 358.36 (8.1%) 238.59 (4.1%) -33.4% ( -42% - -23%) OrHighMed 305.65 (8.6%) 207.21 (4.7%) -32.2% ( -41% - -20%) MedTerm 1342.66 (9.2%) 913.30 (3.4%) -32.0% ( -40% - -21%) LowTerm 2849.62 (10.9%) 2449.59 (5.4%) -14.0% ( -27% - 2%) AndHighHigh 278.22 (3.8%) 249.40 (2.4%) -10.4% ( -15% - -4%) HighPhrase 141.20 (6.5%) 131.19 (4.3%) -7.1% ( -16% - 3%) AndHighMed 410.39 (3.5%) 399.99 (3.1%) -2.5% ( -8% - 4%) HighSpanNear 42.28 (2.7%) 41.21 (2.8%) -2.5% ( -7% - 3%) AndHighLow 1932.50 (8.4%) 1895.71 (8.0%) -1.9% ( -16% - 15%) Fuzzy1 171.83 (4.0%) 168.69 (4.3%) -1.8% ( -9% - 6%) Fuzzy2 47.29 (4.1%) 46.75 (3.1%) -1.1% ( -7% - 6%) Wildcard 441.76 (4.8%) 437.28 (4.8%) -1.0% ( -10% - 8%) Respell 133.99 (3.7%) 132.66 (2.8%) -1.0% ( -7% - 5%) IntNRQ 125.99 (8.7%) 125.24 (7.5%) -0.6% ( -15% - 17%) MedSpanNear 107.53 (3.2%) 107.04 (4.9%) -0.5% ( -8% - 7%) Prefix3 570.56 (4.7%) 568.06 (4.9%) -0.4% ( -9% - 9%) MedSloppyPhrase 247.61 (4.4%) 249.33 (3.6%) 0.7% ( -7% - 9%) LowPhrase 223.67 (3.7%) 225.77 (3.9%) 0.9% ( -6% - 8%) HighSloppyPhrase 46.13 (4.8%) 46.68 (5.9%) 1.2% ( -9% - 12%) PKLookup 381.14 (2.5%) 385.72 (4.3%) 1.2% ( -5% - 8%) LowSpanNear 109.87 (3.6%) 111.83 (4.7%) 1.8% ( -6% - 10%) LowSloppyPhrase 179.23 (3.3%) 184.36 (4.2%) 2.9% ( -4% - 10%) MedPhrase 202.33 (3.0%) 208.91 (4.0%) 3.3% ( -3% - 10%)
        Hide
        Michael McCandless added a comment -

        +1, look awesome, thanks Rob and Adrien!

        Show
        Michael McCandless added a comment - +1, look awesome, thanks Rob and Adrien!
        Hide
        Robert Muir added a comment -

        Thanks for benchmarking Adrien. As you know, the norms are a stupid hotspot because of how we score (we decode multiple times for the same document).

        One thing I was curious about too was the impact on BooleanScorer2 (versus BS1).
        I guess i really wish we fixed luceneutil to display information for both in-order and out-of-order scoring always

        Show
        Robert Muir added a comment - Thanks for benchmarking Adrien. As you know, the norms are a stupid hotspot because of how we score (we decode multiple times for the same document). One thing I was curious about too was the impact on BooleanScorer2 (versus BS1). I guess i really wish we fixed luceneutil to display information for both in-order and out-of-order scoring always
        Hide
        Robert Muir added a comment -

        Thanks Mike and Adrien!

        Show
        Robert Muir added a comment - Thanks Mike and Adrien!
        Hide
        Steve Rowe added a comment -

        Bulk close resolved 4.4 issues

        Show
        Steve Rowe added a comment - Bulk close resolved 4.4 issues

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development