Lucene - Core
  1. Lucene - Core
  2. LUCENE-4764

Faster but more RAM/Disk consuming DocValuesFormat for facets

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 6.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The new default DV format for binary fields has much more
      RAM-efficient encoding of the address for each document ... but it's
      also a bit slower at decode time, which affects facets because we
      decode for every collected docID.

      1. LUCENE-4764.patch
        23 kB
        Michael McCandless
      2. LUCENE-4764.patch
        22 kB
        Shai Erera
      3. LUCENE-4764.patch
        24 kB
        Shai Erera
      4. LUCENE-4764.patch
        22 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Initial dirty patch (lots of nocommits still):

        I added a FacetDocValuesFormat, which goes back to the
        more-RAM-consuming-but-faster-for-facets 4.0 format, and also hacked
        the FastCountingFacetsAggregator to directly decode from the full
        byte[], saving overhead of method-call and filling a BytesRef. It
        gets faster results than default (Lucene42) DVFormat:

        This is wikibig all 6.6M, 7 facet dims:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         LowTerm      110.44      (2.0%)      104.86      (1.0%)   -5.1% (  -7% -   -2%)
                          Fuzzy1       46.50      (2.6%)       44.83      (1.3%)   -3.6% (  -7% -    0%)
                     MedSpanNear       28.61      (2.9%)       27.91      (1.8%)   -2.5% (  -6% -    2%)
                         Respell       45.56      (4.0%)       44.71      (3.1%)   -1.9% (  -8% -    5%)
                          Fuzzy2       52.44      (3.6%)       51.69      (2.2%)   -1.4% (  -6% -    4%)
                       LowPhrase       21.30      (6.3%)       21.01      (6.0%)   -1.4% ( -12% -   11%)
                     LowSpanNear        8.37      (2.4%)        8.26      (3.3%)   -1.3% (  -6% -    4%)
                 MedSloppyPhrase       25.88      (2.4%)       25.73      (2.3%)   -0.6% (  -5% -    4%)
                      AndHighMed      105.02      (1.4%)      105.78      (1.0%)    0.7% (  -1% -    3%)
                 LowSloppyPhrase       20.32      (3.2%)       20.55      (3.5%)    1.1% (  -5% -    8%)
                    HighSpanNear        3.51      (2.4%)        3.56      (1.7%)    1.2% (  -2% -    5%)
                      HighPhrase       17.32     (10.1%)       17.56     (10.2%)    1.4% ( -17% -   24%)
                      AndHighLow      575.37      (3.9%)      583.69      (3.7%)    1.4% (  -5% -    9%)
                HighSloppyPhrase        0.92      (6.2%)        0.95      (6.8%)    2.4% (  -9% -   16%)
                     AndHighHigh       23.25      (1.4%)       24.54      (0.9%)    5.5% (   3% -    7%)
                       MedPhrase      110.00      (5.3%)      117.78      (6.1%)    7.1% (  -4% -   19%)
                        Wildcard       27.31      (2.1%)       32.28      (1.6%)   18.2% (  14% -   22%)
                         MedTerm       46.99      (2.7%)       57.33      (1.8%)   22.0% (  17% -   27%)
                       OrHighMed       16.38      (3.6%)       21.44      (3.2%)   30.9% (  23% -   39%)
                      OrHighHigh        8.63      (3.7%)       11.33      (3.6%)   31.3% (  23% -   39%)
                       OrHighLow       16.88      (3.5%)       22.21      (3.3%)   31.6% (  23% -   39%)
                         Prefix3       12.91      (2.9%)       17.29      (2.0%)   33.9% (  28% -   39%)
                        HighTerm       18.99      (2.8%)       25.99      (2.5%)   36.9% (  30% -   43%)
                          IntNRQ        3.54      (3.2%)        4.96      (2.2%)   40.0% (  33% -   46%)
        

        But it's also more Disk/RAM-consuming: trunk facet DVs take 61.2 MB
        while the patch takes 80.3 MB (31% more).

        Show
        Michael McCandless added a comment - Initial dirty patch (lots of nocommits still): I added a FacetDocValuesFormat, which goes back to the more-RAM-consuming-but-faster-for-facets 4.0 format, and also hacked the FastCountingFacetsAggregator to directly decode from the full byte[], saving overhead of method-call and filling a BytesRef. It gets faster results than default (Lucene42) DVFormat: This is wikibig all 6.6M, 7 facet dims: Task QPS base StdDev QPS comp StdDev Pct diff LowTerm 110.44 (2.0%) 104.86 (1.0%) -5.1% ( -7% - -2%) Fuzzy1 46.50 (2.6%) 44.83 (1.3%) -3.6% ( -7% - 0%) MedSpanNear 28.61 (2.9%) 27.91 (1.8%) -2.5% ( -6% - 2%) Respell 45.56 (4.0%) 44.71 (3.1%) -1.9% ( -8% - 5%) Fuzzy2 52.44 (3.6%) 51.69 (2.2%) -1.4% ( -6% - 4%) LowPhrase 21.30 (6.3%) 21.01 (6.0%) -1.4% ( -12% - 11%) LowSpanNear 8.37 (2.4%) 8.26 (3.3%) -1.3% ( -6% - 4%) MedSloppyPhrase 25.88 (2.4%) 25.73 (2.3%) -0.6% ( -5% - 4%) AndHighMed 105.02 (1.4%) 105.78 (1.0%) 0.7% ( -1% - 3%) LowSloppyPhrase 20.32 (3.2%) 20.55 (3.5%) 1.1% ( -5% - 8%) HighSpanNear 3.51 (2.4%) 3.56 (1.7%) 1.2% ( -2% - 5%) HighPhrase 17.32 (10.1%) 17.56 (10.2%) 1.4% ( -17% - 24%) AndHighLow 575.37 (3.9%) 583.69 (3.7%) 1.4% ( -5% - 9%) HighSloppyPhrase 0.92 (6.2%) 0.95 (6.8%) 2.4% ( -9% - 16%) AndHighHigh 23.25 (1.4%) 24.54 (0.9%) 5.5% ( 3% - 7%) MedPhrase 110.00 (5.3%) 117.78 (6.1%) 7.1% ( -4% - 19%) Wildcard 27.31 (2.1%) 32.28 (1.6%) 18.2% ( 14% - 22%) MedTerm 46.99 (2.7%) 57.33 (1.8%) 22.0% ( 17% - 27%) OrHighMed 16.38 (3.6%) 21.44 (3.2%) 30.9% ( 23% - 39%) OrHighHigh 8.63 (3.7%) 11.33 (3.6%) 31.3% ( 23% - 39%) OrHighLow 16.88 (3.5%) 22.21 (3.3%) 31.6% ( 23% - 39%) Prefix3 12.91 (2.9%) 17.29 (2.0%) 33.9% ( 28% - 39%) HighTerm 18.99 (2.8%) 25.99 (2.5%) 36.9% ( 30% - 43%) IntNRQ 3.54 (3.2%) 4.96 (2.2%) 40.0% ( 33% - 46%) But it's also more Disk/RAM-consuming: trunk facet DVs take 61.2 MB while the patch takes 80.3 MB (31% more).
        Hide
        Robert Muir added a comment -

        should it really write a byte[]? i wonder how it would perform if it wrote and kept in ram packed ints, since it knows whats in the byte[].

        it would just make a byte[] on the fly to satisfy merging etc but otherwise provide an int-based interface for facets?

        Show
        Robert Muir added a comment - should it really write a byte[]? i wonder how it would perform if it wrote and kept in ram packed ints, since it knows whats in the byte[]. it would just make a byte[] on the fly to satisfy merging etc but otherwise provide an int-based interface for facets?
        Hide
        Shai Erera added a comment -

        i wonder how it would perform if it wrote and kept in ram packed ints, since it knows whats in the byte[]

        We've tried that in the past. I don't remember on which issue we posted the results, but they were not compelling. I.e. what we tried is to keep the ints as int[] vs packed-ints. int[] performed (IIRC) 50% faster, while packed-int only ~6-10% faster. Also, their RAM footprint was very close. The problem is that packed-ints is only good if you know something about the numbers, i.e. their size, distribution etc. But with category ordinals, on this Wikipedia index, there's nothing "special" about them. Really every document keeps close to arbitrary integers between 1 - 2.2M ...

        If the following math holds – 25 ords per document (that's 100 bytes/doc) x 6.6M documents – that's going to be ~660MB (offsets not included). I suspect that packed-ints will consume approximately the same size (at least, per past results) but won't yield significantly better performance. Therefore if we want to cache anything at the int level, we should do an int[] caching aggregator.

        Mike, correct me if I'm wrong.

        Show
        Shai Erera added a comment - i wonder how it would perform if it wrote and kept in ram packed ints, since it knows whats in the byte[] We've tried that in the past. I don't remember on which issue we posted the results, but they were not compelling. I.e. what we tried is to keep the ints as int[] vs packed-ints. int[] performed (IIRC) 50% faster, while packed-int only ~6-10% faster. Also, their RAM footprint was very close. The problem is that packed-ints is only good if you know something about the numbers, i.e. their size, distribution etc. But with category ordinals, on this Wikipedia index, there's nothing "special" about them. Really every document keeps close to arbitrary integers between 1 - 2.2M ... If the following math holds – 25 ords per document (that's 100 bytes/doc) x 6.6M documents – that's going to be ~660MB (offsets not included). I suspect that packed-ints will consume approximately the same size (at least, per past results) but won't yield significantly better performance. Therefore if we want to cache anything at the int level, we should do an int[] caching aggregator. Mike, correct me if I'm wrong.
        Hide
        Robert Muir added a comment -

        The problem is that packed-ints is only good if you know something about the numbers, i.e. their size, distribution etc. But with category ordinals, on this Wikipedia index, there's nothing "special" about them. Really every document keeps close to arbitrary integers between 1 - 2.2M

        If the following math holds – 25 ords per document

        Right but i dont look at what its doing this way. Today the ords for the document are vint-deltas (or similar) within a byte[] right?

        So instead perhaps the codec could encode the "first ord" (minimum) for the doc in a simple int[] or whatever, but the additional deltas are all within a big packed stream or something like that.

        In all cases i like the idea of a specialized docvaluesformat for facets. it doesn't have to be one-sized-fits-all: it could have a number of strategies depending on whether someone had 5 ords/doc or 500 ords/doc for example, by examining the iterator once at index-time to decide.

        Show
        Robert Muir added a comment - The problem is that packed-ints is only good if you know something about the numbers, i.e. their size, distribution etc. But with category ordinals, on this Wikipedia index, there's nothing "special" about them. Really every document keeps close to arbitrary integers between 1 - 2.2M If the following math holds – 25 ords per document Right but i dont look at what its doing this way. Today the ords for the document are vint-deltas (or similar) within a byte[] right? So instead perhaps the codec could encode the "first ord" (minimum) for the doc in a simple int[] or whatever, but the additional deltas are all within a big packed stream or something like that. In all cases i like the idea of a specialized docvaluesformat for facets. it doesn't have to be one-sized-fits-all: it could have a number of strategies depending on whether someone had 5 ords/doc or 500 ords/doc for example, by examining the iterator once at index-time to decide.
        Hide
        Shai Erera added a comment -

        I think that it would actually be interesting to test only VInt, without dgap. Because the ords seem to be arbitrary, I'm not even sure what they buy us. Mike, can you try that? Index with a Sorting(Unique(VInt8)) and modify FastCountingFacetsAggregator to not do dgap? Would be interesting to see the effects on compression as well as speed. Dgap is something you want to do if you suspect that a document will have e.g. higher ordinals, that are close to each other in such a way that dgap would make them compress better ...

        Robert, if I understand your proposal correctly, what you suggest is to encode:

        int[] – pairs of highest/lowest ordinal in a document + length (#additional ords)
        byte[] – a packed-int of deltas for all documents (but deltas are computed off the absolute ord in the int[]

        Why would that be better than a single byte[] (packed-ints) + offsets?

        Show
        Shai Erera added a comment - I think that it would actually be interesting to test only VInt, without dgap. Because the ords seem to be arbitrary, I'm not even sure what they buy us. Mike, can you try that? Index with a Sorting(Unique(VInt8)) and modify FastCountingFacetsAggregator to not do dgap? Would be interesting to see the effects on compression as well as speed. Dgap is something you want to do if you suspect that a document will have e.g. higher ordinals, that are close to each other in such a way that dgap would make them compress better ... Robert, if I understand your proposal correctly, what you suggest is to encode: int[] – pairs of highest/lowest ordinal in a document + length (#additional ords) byte[] – a packed-int of deltas for all documents (but deltas are computed off the absolute ord in the int[] Why would that be better than a single byte[] (packed-ints) + offsets?
        Hide
        Shai Erera added a comment -

        I think that 30% more RAM is ok .. i.e. either you will have enough RAM on the machine, or those 30% won't make a big difference (for really large indexes). What bothers me is that there's no way to do that out-of-the-box ... not with how facets are indexed today. E.g., if facets were in core, then we could modify IWC to detect when facets are used (e.g. isEnableFacets) and then create the optimized Codec for them...

        And the problem is that unlike with a caching decision yes/no, here the situation is that facets are loaded into RAM by default, we just offer a better way to load them. I think that if we can find a justification to a FacetsCodec in general, then we could stuff such optimizations in and would tell users that if they want to index facets, they should work with that Codec...

        Or .. we can just leave it as-is and document somewhere that you might want to consider that DV format, at the expense of more RAM but faster search.

        Show
        Shai Erera added a comment - I think that 30% more RAM is ok .. i.e. either you will have enough RAM on the machine, or those 30% won't make a big difference (for really large indexes). What bothers me is that there's no way to do that out-of-the-box ... not with how facets are indexed today. E.g., if facets were in core, then we could modify IWC to detect when facets are used (e.g. isEnableFacets) and then create the optimized Codec for them... And the problem is that unlike with a caching decision yes/no, here the situation is that facets are loaded into RAM by default, we just offer a better way to load them. I think that if we can find a justification to a FacetsCodec in general, then we could stuff such optimizations in and would tell users that if they want to index facets, they should work with that Codec... Or .. we can just leave it as-is and document somewhere that you might want to consider that DV format, at the expense of more RAM but faster search.
        Hide
        Robert Muir added a comment -

        I think that 30% more RAM is ok .. i.e. either you will have enough RAM on the machine, or those 30% won't make a big difference (for really large indexes).

        This is misleading. Its not a constant 30%. The larger the index, the larger the cost.

        E.g., if facets were in core, then we could modify IWC to detect when facets are used (e.g. isEnableFacets) and then create the optimized Codec for them...

        I would be against this even if facets were in core. I think its ok to add this option, but i don't think its a good default. I think the benchmark being used here is easily misleading.

        Show
        Robert Muir added a comment - I think that 30% more RAM is ok .. i.e. either you will have enough RAM on the machine, or those 30% won't make a big difference (for really large indexes). This is misleading. Its not a constant 30%. The larger the index, the larger the cost. E.g., if facets were in core, then we could modify IWC to detect when facets are used (e.g. isEnableFacets) and then create the optimized Codec for them... I would be against this even if facets were in core. I think its ok to add this option, but i don't think its a good default. I think the benchmark being used here is easily misleading.
        Hide
        Robert Muir added a comment -

        And again by option: means users pick their codec the normal way (IndexWriterConfig.setCodec).

        We don't need to do any sneaky automatic codec-picking.

        Show
        Robert Muir added a comment - And again by option: means users pick their codec the normal way (IndexWriterConfig.setCodec). We don't need to do any sneaky automatic codec-picking.
        Hide
        Shai Erera added a comment -

        I don't think if we chose the Codec it's sneaky. It's just like Lucene defaults to Lucene42Codec. If, hypothetically (I don't think facets go into core anytime soon), facets were in core, then the default Codec decision could take isFacetsEnabled into consideration. That's all I was saying.

        Show
        Shai Erera added a comment - I don't think if we chose the Codec it's sneaky. It's just like Lucene defaults to Lucene42Codec. If, hypothetically (I don't think facets go into core anytime soon), facets were in core, then the default Codec decision could take isFacetsEnabled into consideration. That's all I was saying.
        Hide
        Michael McCandless added a comment -

        The 30% RAM overhead I measured was for the 7 dims per doc case.

        And the RAM overhead will vary greatly depending on the app: if you have fewer facet dims, and more docs, the overhead is higher.

        I think it should just be an optional DV Format that makes the obvious tradeoffs, and the javadocs should say "your mileage may vary", ie the time/space tradeoff will be app dependent.

        Show
        Michael McCandless added a comment - The 30% RAM overhead I measured was for the 7 dims per doc case. And the RAM overhead will vary greatly depending on the app: if you have fewer facet dims, and more docs, the overhead is higher. I think it should just be an optional DV Format that makes the obvious tradeoffs, and the javadocs should say "your mileage may vary", ie the time/space tradeoff will be app dependent.
        Hide
        Michael McCandless added a comment -

        I think that it would actually be interesting to test only VInt, without dgap. Because the ords seem to be arbitrary, I'm not even sure what they buy us. Mike, can you try that?

        No dgap compression, 1M docs, 7 dims per doc. Looks like we lost a bit:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         MedTerm      258.50      (1.5%)      252.69      (1.6%)   -2.2% (  -5% -    0%)
                       OrHighLow       55.96      (2.4%)       54.73      (2.0%)   -2.2% (  -6% -    2%)
                       OrHighMed       57.47      (2.4%)       56.33      (2.1%)   -2.0% (  -6% -    2%)
                      HighPhrase       44.47     (10.9%)       43.63     (10.7%)   -1.9% ( -21% -   22%)
                      OrHighHigh       38.53      (2.6%)       37.88      (2.3%)   -1.7% (  -6% -    3%)
                        HighTerm       65.49      (1.2%)       64.70      (1.9%)   -1.2% (  -4% -    1%)
                         Prefix3       46.82      (1.5%)       46.30      (1.2%)   -1.1% (  -3% -    1%)
                       MedPhrase      149.78      (5.5%)      148.17      (5.3%)   -1.1% ( -11% -   10%)
                     AndHighHigh       93.50      (1.0%)       92.73      (0.8%)   -0.8% (  -2% -    1%)
                HighSloppyPhrase        3.26      (6.8%)        3.24      (8.0%)   -0.8% ( -14% -   15%)
                    HighSpanNear       11.60      (1.7%)       11.51      (1.9%)   -0.8% (  -4% -    2%)
                       LowPhrase       73.57      (5.6%)       73.00      (5.0%)   -0.8% ( -10% -   10%)
                     LowSpanNear       43.68      (2.0%)       43.35      (2.3%)   -0.8% (  -4% -    3%)
                     MedSpanNear       90.77      (1.5%)       90.10      (1.4%)   -0.7% (  -3% -    2%)
                 LowSloppyPhrase       82.66      (1.9%)       82.13      (1.7%)   -0.6% (  -4% -    2%)
                 MedSloppyPhrase       92.12      (2.2%)       91.65      (2.2%)   -0.5% (  -4% -    3%)
                         LowTerm      466.62      (1.4%)      464.83      (1.9%)   -0.4% (  -3% -    2%)
                      AndHighMed      347.12      (1.7%)      348.61      (1.1%)    0.4% (  -2% -    3%)
                        Wildcard      120.82      (1.2%)      121.50      (1.6%)    0.6% (  -2% -    3%)
                          IntNRQ       23.40      (1.6%)       23.76      (1.4%)    1.5% (  -1% -    4%)
                          Fuzzy1       80.87      (2.4%)       82.38      (2.6%)    1.9% (  -3% -    7%)
                         Respell       71.83      (3.0%)       73.46      (3.2%)    2.3% (  -3% -    8%)
                      AndHighLow     1159.47      (3.8%)     1189.72      (2.4%)    2.6% (  -3% -    9%)
                          Fuzzy2       88.04      (3.0%)       91.48      (3.7%)    3.9% (  -2% -   10%)
        

        Trunk bytes for the DV facet field was 9219009, and no-dgap was
        10163419 (~10% larger). So net/net dGap seems to help!

        Show
        Michael McCandless added a comment - I think that it would actually be interesting to test only VInt, without dgap. Because the ords seem to be arbitrary, I'm not even sure what they buy us. Mike, can you try that? No dgap compression, 1M docs, 7 dims per doc. Looks like we lost a bit: Task QPS base StdDev QPS comp StdDev Pct diff MedTerm 258.50 (1.5%) 252.69 (1.6%) -2.2% ( -5% - 0%) OrHighLow 55.96 (2.4%) 54.73 (2.0%) -2.2% ( -6% - 2%) OrHighMed 57.47 (2.4%) 56.33 (2.1%) -2.0% ( -6% - 2%) HighPhrase 44.47 (10.9%) 43.63 (10.7%) -1.9% ( -21% - 22%) OrHighHigh 38.53 (2.6%) 37.88 (2.3%) -1.7% ( -6% - 3%) HighTerm 65.49 (1.2%) 64.70 (1.9%) -1.2% ( -4% - 1%) Prefix3 46.82 (1.5%) 46.30 (1.2%) -1.1% ( -3% - 1%) MedPhrase 149.78 (5.5%) 148.17 (5.3%) -1.1% ( -11% - 10%) AndHighHigh 93.50 (1.0%) 92.73 (0.8%) -0.8% ( -2% - 1%) HighSloppyPhrase 3.26 (6.8%) 3.24 (8.0%) -0.8% ( -14% - 15%) HighSpanNear 11.60 (1.7%) 11.51 (1.9%) -0.8% ( -4% - 2%) LowPhrase 73.57 (5.6%) 73.00 (5.0%) -0.8% ( -10% - 10%) LowSpanNear 43.68 (2.0%) 43.35 (2.3%) -0.8% ( -4% - 3%) MedSpanNear 90.77 (1.5%) 90.10 (1.4%) -0.7% ( -3% - 2%) LowSloppyPhrase 82.66 (1.9%) 82.13 (1.7%) -0.6% ( -4% - 2%) MedSloppyPhrase 92.12 (2.2%) 91.65 (2.2%) -0.5% ( -4% - 3%) LowTerm 466.62 (1.4%) 464.83 (1.9%) -0.4% ( -3% - 2%) AndHighMed 347.12 (1.7%) 348.61 (1.1%) 0.4% ( -2% - 3%) Wildcard 120.82 (1.2%) 121.50 (1.6%) 0.6% ( -2% - 3%) IntNRQ 23.40 (1.6%) 23.76 (1.4%) 1.5% ( -1% - 4%) Fuzzy1 80.87 (2.4%) 82.38 (2.6%) 1.9% ( -3% - 7%) Respell 71.83 (3.0%) 73.46 (3.2%) 2.3% ( -3% - 8%) AndHighLow 1159.47 (3.8%) 1189.72 (2.4%) 2.6% ( -3% - 9%) Fuzzy2 88.04 (3.0%) 91.48 (3.7%) 3.9% ( -2% - 10%) Trunk bytes for the DV facet field was 9219009, and no-dgap was 10163419 (~10% larger). So net/net dGap seems to help!
        Hide
        Shai Erera added a comment -

        Thanks for measuring Mike. It's good to know dgap helps!

        Show
        Shai Erera added a comment - Thanks for measuring Mike. It's good to know dgap helps!
        Hide
        Michael McCandless added a comment -

        I re-tested trunk vs this new DV format, with all 9 dims on the full 6.6M wikibig index. (The added 2 dims, username and categories, have many many unique values):

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                      HighPhrase       13.68      (8.1%)       13.64      (8.4%)   -0.3% ( -15% -   17%)
                       LowPhrase       15.05      (4.4%)       15.08      (4.4%)    0.1% (  -8% -    9%)
                     LowSpanNear        7.12      (2.5%)        7.17      (2.3%)    0.6% (  -4% -    5%)
                      AndHighLow       64.03      (1.3%)       64.55      (1.3%)    0.8% (  -1% -    3%)
                HighSloppyPhrase        0.82      (5.7%)        0.83      (4.8%)    1.1% (  -8% -   12%)
                         Respell       44.90      (4.0%)       45.43      (4.3%)    1.2% (  -6% -    9%)
                 LowSloppyPhrase       15.37      (2.1%)       15.57      (1.8%)    1.3% (  -2% -    5%)
                    HighSpanNear        2.91      (1.8%)        2.95      (1.9%)    1.3% (  -2% -    5%)
                          Fuzzy2       28.55      (2.0%)       29.02      (2.1%)    1.7% (  -2% -    5%)
                 MedSloppyPhrase       16.56      (1.2%)       16.94      (1.2%)    2.3% (   0% -    4%)
                      AndHighMed       39.47      (0.8%)       40.40      (1.0%)    2.4% (   0% -    4%)
                          Fuzzy1       24.08      (1.3%)       24.73      (1.4%)    2.7% (   0% -    5%)
                     MedSpanNear       17.70      (1.6%)       18.19      (1.6%)    2.8% (   0% -    6%)
                       MedPhrase       41.06      (2.2%)       42.46      (2.6%)    3.4% (  -1% -    8%)
                         LowTerm       34.19      (0.9%)       35.69      (1.0%)    4.4% (   2% -    6%)
                     AndHighHigh       11.92      (1.2%)       12.50      (1.1%)    4.9% (   2% -    7%)
                        Wildcard       13.13      (1.8%)       14.43      (1.5%)    9.9% (   6% -   13%)
                       OrHighMed        7.09      (2.7%)        7.85      (1.6%)   10.8% (   6% -   15%)
                       OrHighLow        7.16      (2.3%)        7.93      (1.6%)   10.8% (   6% -   15%)
                        HighTerm        7.59      (2.3%)        8.47      (1.6%)   11.5% (   7% -   15%)
                         MedTerm       20.14      (1.9%)       22.82      (1.1%)   13.3% (  10% -   16%)
                         Prefix3        5.78      (2.2%)        6.56      (1.5%)   13.4% (   9% -   17%)
                      OrHighHigh        4.03      (2.3%)        4.65      (2.0%)   15.4% (  10% -   20%)
                          IntNRQ        1.92      (2.2%)        2.45      (1.9%)   27.5% (  22% -   32%)
        

        145.3 MB for the new DV vs 129.0 MB for trunk = ~12.6% bigger.

        Show
        Michael McCandless added a comment - I re-tested trunk vs this new DV format, with all 9 dims on the full 6.6M wikibig index. (The added 2 dims, username and categories, have many many unique values): Task QPS base StdDev QPS comp StdDev Pct diff HighPhrase 13.68 (8.1%) 13.64 (8.4%) -0.3% ( -15% - 17%) LowPhrase 15.05 (4.4%) 15.08 (4.4%) 0.1% ( -8% - 9%) LowSpanNear 7.12 (2.5%) 7.17 (2.3%) 0.6% ( -4% - 5%) AndHighLow 64.03 (1.3%) 64.55 (1.3%) 0.8% ( -1% - 3%) HighSloppyPhrase 0.82 (5.7%) 0.83 (4.8%) 1.1% ( -8% - 12%) Respell 44.90 (4.0%) 45.43 (4.3%) 1.2% ( -6% - 9%) LowSloppyPhrase 15.37 (2.1%) 15.57 (1.8%) 1.3% ( -2% - 5%) HighSpanNear 2.91 (1.8%) 2.95 (1.9%) 1.3% ( -2% - 5%) Fuzzy2 28.55 (2.0%) 29.02 (2.1%) 1.7% ( -2% - 5%) MedSloppyPhrase 16.56 (1.2%) 16.94 (1.2%) 2.3% ( 0% - 4%) AndHighMed 39.47 (0.8%) 40.40 (1.0%) 2.4% ( 0% - 4%) Fuzzy1 24.08 (1.3%) 24.73 (1.4%) 2.7% ( 0% - 5%) MedSpanNear 17.70 (1.6%) 18.19 (1.6%) 2.8% ( 0% - 6%) MedPhrase 41.06 (2.2%) 42.46 (2.6%) 3.4% ( -1% - 8%) LowTerm 34.19 (0.9%) 35.69 (1.0%) 4.4% ( 2% - 6%) AndHighHigh 11.92 (1.2%) 12.50 (1.1%) 4.9% ( 2% - 7%) Wildcard 13.13 (1.8%) 14.43 (1.5%) 9.9% ( 6% - 13%) OrHighMed 7.09 (2.7%) 7.85 (1.6%) 10.8% ( 6% - 15%) OrHighLow 7.16 (2.3%) 7.93 (1.6%) 10.8% ( 6% - 15%) HighTerm 7.59 (2.3%) 8.47 (1.6%) 11.5% ( 7% - 15%) MedTerm 20.14 (1.9%) 22.82 (1.1%) 13.3% ( 10% - 16%) Prefix3 5.78 (2.2%) 6.56 (1.5%) 13.4% ( 9% - 17%) OrHighHigh 4.03 (2.3%) 4.65 (2.0%) 15.4% ( 10% - 20%) IntNRQ 1.92 (2.2%) 2.45 (1.9%) 27.5% ( 22% - 32%) 145.3 MB for the new DV vs 129.0 MB for trunk = ~12.6% bigger.
        Hide
        Shai Erera added a comment -

        Facets42Codec has a nocommit about handling multiple category lists as well as if the default field has changed. Currently (in the patch), it hard-codes to "$facets", but that won't work if e.g. the app indexed categories into a different field.

        Talking with Mike about it yesterday, I thought that what needs to be done is for the codec to receive the FacetIndexingParams, build a HashSet<String> of all fields that hold facets, and then use it in .getDocValuesFormatForField.

        However, I realized later that this is not doable, since Codecs must have a default constructor, and b/c of how they are initialized, they cannot rely on stuff passed to them in the ctor (e.g. when they are initialized by a reader?). Is that true? I looked at few Codecs impl, and looks like none relies on stuff passed to it in the ctor.

        If so, perhaps we should also override the FieldInfosFormat and use it to detect which fields are "facet" fields? E.g. it will be a subset of all fields that have BinaryDV. But that's not distinguishing enough ... and we cannot add a DVType, so cannot distinguish BINARY from FACETS_BINARY even if we wanted to make a different BinaryDV extension ...

        Crazy, but can we write a boolean to FieldInfo hasFacets? Is it supported if we e.g. extend (I realize, many) classes?

        Show
        Shai Erera added a comment - Facets42Codec has a nocommit about handling multiple category lists as well as if the default field has changed. Currently (in the patch), it hard-codes to "$facets", but that won't work if e.g. the app indexed categories into a different field. Talking with Mike about it yesterday, I thought that what needs to be done is for the codec to receive the FacetIndexingParams, build a HashSet<String> of all fields that hold facets, and then use it in .getDocValuesFormatForField. However, I realized later that this is not doable, since Codecs must have a default constructor, and b/c of how they are initialized, they cannot rely on stuff passed to them in the ctor (e.g. when they are initialized by a reader?). Is that true? I looked at few Codecs impl, and looks like none relies on stuff passed to it in the ctor. If so, perhaps we should also override the FieldInfosFormat and use it to detect which fields are "facet" fields? E.g. it will be a subset of all fields that have BinaryDV. But that's not distinguishing enough ... and we cannot add a DVType, so cannot distinguish BINARY from FACETS_BINARY even if we wanted to make a different BinaryDV extension ... Crazy, but can we write a boolean to FieldInfo hasFacets ? Is it supported if we e.g. extend (I realize, many) classes?
        Hide
        Michael McCandless added a comment -

        I decided to test whether the specialization (checking if DV format is
        FacetDVFormat and "directly" accessing its address/bytes) helps:

        Base = new DV format; comp = new DV format + spec, 9 dims:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                     LowSpanNear        7.15      (2.3%)        7.14      (2.0%)   -0.1% (  -4% -    4%)
                         Respell       45.60      (3.4%)       45.64      (3.3%)    0.1% (  -6% -    7%)
                          Fuzzy1       24.79      (1.4%)       24.85      (1.3%)    0.3% (  -2% -    2%)
                     MedSpanNear       18.07      (1.3%)       18.12      (1.6%)    0.3% (  -2% -    3%)
                      AndHighMed       40.34      (0.8%)       40.47      (0.9%)    0.3% (  -1% -    2%)
                       MedPhrase       42.25      (2.9%)       42.40      (2.7%)    0.4% (  -5% -    6%)
                         LowTerm       35.62      (1.1%)       35.76      (1.3%)    0.4% (  -2% -    2%)
                      AndHighLow       64.53      (1.7%)       64.78      (1.2%)    0.4% (  -2% -    3%)
                          Fuzzy2       29.06      (1.6%)       29.19      (1.7%)    0.4% (  -2% -    3%)
                 MedSloppyPhrase       16.88      (1.1%)       16.97      (1.5%)    0.5% (  -2% -    3%)
                       LowPhrase       15.01      (4.7%)       15.09      (4.8%)    0.5% (  -8% -   10%)
                    HighSpanNear        2.92      (1.9%)        2.94      (1.7%)    0.7% (  -2% -    4%)
                 LowSloppyPhrase       15.48      (1.6%)       15.60      (2.1%)    0.7% (  -2% -    4%)
                      HighPhrase       13.50      (8.8%)       13.60      (8.6%)    0.7% ( -15% -   19%)
                         MedTerm       22.64      (1.1%)       22.91      (1.2%)    1.2% (  -1% -    3%)
                        Wildcard       14.29      (0.9%)       14.47      (1.4%)    1.3% (   0% -    3%)
                     AndHighHigh       12.40      (0.9%)       12.56      (1.2%)    1.3% (   0% -    3%)
                HighSloppyPhrase        0.82      (4.3%)        0.83      (5.2%)    1.9% (  -7% -   11%)
                       OrHighMed        7.74      (1.3%)        7.90      (1.4%)    2.0% (   0% -    4%)
                       OrHighLow        7.82      (1.4%)        7.98      (1.7%)    2.0% (   0% -    5%)
                        HighTerm        8.35      (1.1%)        8.52      (1.5%)    2.1% (   0% -    4%)
                         Prefix3        6.48      (1.1%)        6.62      (1.1%)    2.3% (   0% -    4%)
                      OrHighHigh        4.58      (1.6%)        4.69      (1.5%)    2.3% (   0% -    5%)
                          IntNRQ        2.41      (1.6%)        2.48      (1.5%)    2.7% (   0% -    5%)
        

        Same, but w/ 7 dims:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                     MedSpanNear       28.73      (1.9%)       28.33      (2.7%)   -1.4% (  -5% -    3%)
                         Respell       45.08      (4.7%)       44.73      (4.0%)   -0.8% (  -9% -    8%)
                     LowSpanNear        8.38      (2.6%)        8.33      (2.5%)   -0.6% (  -5% -    4%)
                          Fuzzy2       52.13      (3.5%)       51.85      (3.5%)   -0.5% (  -7% -    6%)
                    HighSpanNear        3.53      (1.7%)        3.51      (1.9%)   -0.5% (  -3% -    3%)
                          Fuzzy1       46.42      (2.5%)       46.29      (2.3%)   -0.3% (  -4% -    4%)
                       MedPhrase      109.24      (5.5%)      109.16      (5.9%)   -0.1% ( -10% -   11%)
                      HighPhrase       17.28     (10.4%)       17.28     (10.6%)    0.0% ( -19% -   23%)
                HighSloppyPhrase        0.92      (8.0%)        0.92      (5.9%)    0.0% ( -12% -   15%)
                     AndHighHigh       23.28      (1.2%)       23.29      (0.8%)    0.0% (  -1% -    2%)
                       LowPhrase       21.08      (6.1%)       21.10      (6.6%)    0.1% ( -11% -   13%)
                      AndHighLow      586.97      (2.5%)      587.46      (2.3%)    0.1% (  -4% -    5%)
                 LowSloppyPhrase       20.38      (3.1%)       20.41      (2.6%)    0.1% (  -5% -    6%)
                         LowTerm      110.38      (2.0%)      110.52      (1.4%)    0.1% (  -3% -    3%)
                      AndHighMed      105.08      (1.0%)      105.31      (0.9%)    0.2% (  -1% -    2%)
                        Wildcard       27.23      (2.5%)       27.30      (1.8%)    0.3% (  -3% -    4%)
                 MedSloppyPhrase       25.94      (3.2%)       26.04      (2.1%)    0.4% (  -4% -    5%)
                          IntNRQ        3.52      (3.6%)        3.54      (2.6%)    0.6% (  -5% -    7%)
                        HighTerm       19.05      (3.3%)       19.18      (2.7%)    0.6% (  -5% -    6%)
                         Prefix3       12.89      (3.3%)       12.97      (2.3%)    0.7% (  -4% -    6%)
                         MedTerm       46.70      (3.0%)       47.06      (2.6%)    0.8% (  -4% -    6%)
                       OrHighLow       17.06      (4.2%)       17.22      (3.5%)    1.0% (  -6% -    9%)
                       OrHighMed       16.54      (4.2%)       16.71      (3.6%)    1.0% (  -6% -    9%)
                      OrHighHigh        8.72      (4.4%)        8.83      (3.7%)    1.2% (  -6% -    9%)
        

        So net/net the specialization doesn't help much here...

        Show
        Michael McCandless added a comment - I decided to test whether the specialization (checking if DV format is FacetDVFormat and "directly" accessing its address/bytes) helps: Base = new DV format; comp = new DV format + spec, 9 dims: Task QPS base StdDev QPS comp StdDev Pct diff LowSpanNear 7.15 (2.3%) 7.14 (2.0%) -0.1% ( -4% - 4%) Respell 45.60 (3.4%) 45.64 (3.3%) 0.1% ( -6% - 7%) Fuzzy1 24.79 (1.4%) 24.85 (1.3%) 0.3% ( -2% - 2%) MedSpanNear 18.07 (1.3%) 18.12 (1.6%) 0.3% ( -2% - 3%) AndHighMed 40.34 (0.8%) 40.47 (0.9%) 0.3% ( -1% - 2%) MedPhrase 42.25 (2.9%) 42.40 (2.7%) 0.4% ( -5% - 6%) LowTerm 35.62 (1.1%) 35.76 (1.3%) 0.4% ( -2% - 2%) AndHighLow 64.53 (1.7%) 64.78 (1.2%) 0.4% ( -2% - 3%) Fuzzy2 29.06 (1.6%) 29.19 (1.7%) 0.4% ( -2% - 3%) MedSloppyPhrase 16.88 (1.1%) 16.97 (1.5%) 0.5% ( -2% - 3%) LowPhrase 15.01 (4.7%) 15.09 (4.8%) 0.5% ( -8% - 10%) HighSpanNear 2.92 (1.9%) 2.94 (1.7%) 0.7% ( -2% - 4%) LowSloppyPhrase 15.48 (1.6%) 15.60 (2.1%) 0.7% ( -2% - 4%) HighPhrase 13.50 (8.8%) 13.60 (8.6%) 0.7% ( -15% - 19%) MedTerm 22.64 (1.1%) 22.91 (1.2%) 1.2% ( -1% - 3%) Wildcard 14.29 (0.9%) 14.47 (1.4%) 1.3% ( 0% - 3%) AndHighHigh 12.40 (0.9%) 12.56 (1.2%) 1.3% ( 0% - 3%) HighSloppyPhrase 0.82 (4.3%) 0.83 (5.2%) 1.9% ( -7% - 11%) OrHighMed 7.74 (1.3%) 7.90 (1.4%) 2.0% ( 0% - 4%) OrHighLow 7.82 (1.4%) 7.98 (1.7%) 2.0% ( 0% - 5%) HighTerm 8.35 (1.1%) 8.52 (1.5%) 2.1% ( 0% - 4%) Prefix3 6.48 (1.1%) 6.62 (1.1%) 2.3% ( 0% - 4%) OrHighHigh 4.58 (1.6%) 4.69 (1.5%) 2.3% ( 0% - 5%) IntNRQ 2.41 (1.6%) 2.48 (1.5%) 2.7% ( 0% - 5%) Same, but w/ 7 dims: Task QPS base StdDev QPS comp StdDev Pct diff MedSpanNear 28.73 (1.9%) 28.33 (2.7%) -1.4% ( -5% - 3%) Respell 45.08 (4.7%) 44.73 (4.0%) -0.8% ( -9% - 8%) LowSpanNear 8.38 (2.6%) 8.33 (2.5%) -0.6% ( -5% - 4%) Fuzzy2 52.13 (3.5%) 51.85 (3.5%) -0.5% ( -7% - 6%) HighSpanNear 3.53 (1.7%) 3.51 (1.9%) -0.5% ( -3% - 3%) Fuzzy1 46.42 (2.5%) 46.29 (2.3%) -0.3% ( -4% - 4%) MedPhrase 109.24 (5.5%) 109.16 (5.9%) -0.1% ( -10% - 11%) HighPhrase 17.28 (10.4%) 17.28 (10.6%) 0.0% ( -19% - 23%) HighSloppyPhrase 0.92 (8.0%) 0.92 (5.9%) 0.0% ( -12% - 15%) AndHighHigh 23.28 (1.2%) 23.29 (0.8%) 0.0% ( -1% - 2%) LowPhrase 21.08 (6.1%) 21.10 (6.6%) 0.1% ( -11% - 13%) AndHighLow 586.97 (2.5%) 587.46 (2.3%) 0.1% ( -4% - 5%) LowSloppyPhrase 20.38 (3.1%) 20.41 (2.6%) 0.1% ( -5% - 6%) LowTerm 110.38 (2.0%) 110.52 (1.4%) 0.1% ( -3% - 3%) AndHighMed 105.08 (1.0%) 105.31 (0.9%) 0.2% ( -1% - 2%) Wildcard 27.23 (2.5%) 27.30 (1.8%) 0.3% ( -3% - 4%) MedSloppyPhrase 25.94 (3.2%) 26.04 (2.1%) 0.4% ( -4% - 5%) IntNRQ 3.52 (3.6%) 3.54 (2.6%) 0.6% ( -5% - 7%) HighTerm 19.05 (3.3%) 19.18 (2.7%) 0.6% ( -5% - 6%) Prefix3 12.89 (3.3%) 12.97 (2.3%) 0.7% ( -4% - 6%) MedTerm 46.70 (3.0%) 47.06 (2.6%) 0.8% ( -4% - 6%) OrHighLow 17.06 (4.2%) 17.22 (3.5%) 1.0% ( -6% - 9%) OrHighMed 16.54 (4.2%) 16.71 (3.6%) 1.0% ( -6% - 9%) OrHighHigh 8.72 (4.4%) 8.83 (3.7%) 1.2% ( -6% - 9%) So net/net the specialization doesn't help much here...
        Hide
        Michael McCandless added a comment -

        However, I realized later that this is not doable, since Codecs must have a default constructor, and b/c of how they are initialized, they cannot rely on stuff passed to them in the ctor (e.g. when they are initialized by a reader?).

        Actually this is a non-issue: the PerFieldDVFormat writes which format was used for which field, and uses that at read-time. The index is "self-contained" and you don't need any special logic at read-time ...

        Show
        Michael McCandless added a comment - However, I realized later that this is not doable, since Codecs must have a default constructor, and b/c of how they are initialized, they cannot rely on stuff passed to them in the ctor (e.g. when they are initialized by a reader?). Actually this is a non-issue: the PerFieldDVFormat writes which format was used for which field, and uses that at read-time. The index is "self-contained" and you don't need any special logic at read-time ...
        Hide
        Shai Erera added a comment -

        Thanks for testing mike. So I guess FastCountingFA can go back to not specialize on this, and this remains a faster DV format only?

        And thanks for clarifying how those formats work. So I think that Facets42Codec can take a FIP in the ctor.

        Show
        Shai Erera added a comment - Thanks for testing mike. So I guess FastCountingFA can go back to not specialize on this, and this remains a faster DV format only? And thanks for clarifying how those formats work. So I think that Facets42Codec can take a FIP in the ctor.
        Hide
        Michael McCandless added a comment -

        So I guess FastCountingFA can go back to not specialize on this, and this remains a faster DV format only?

        We can do that, I'm fine with it: the gains are minor.

        Show
        Michael McCandless added a comment - So I guess FastCountingFA can go back to not specialize on this, and this remains a faster DV format only? We can do that, I'm fine with it: the gains are minor.
        Hide
        Shai Erera added a comment -

        Patch handles some nocommits. Now Facet42Codec takes a FacetIndexingParams and builds a HashSet over the fields returned by fip.getAllCLPs(), and uses it in getDVFForField.

        Also, this codec cannot support facet partitions, since the number of partitions is unknown in advance (each partition corresponds to a field). So I throw an IllegalArgEx.

        I renamed the package to o.a.l.facet.codecs.facet42 and moved everything under it. Facet42Codec and Facet42DVF are the only public classes.

        I also added a resources folder which declares the new DVF.

        Made FacetTestCase randomly select the new codec (30% of the times), all tests pass. Note though that only tests that use the default FacetIndexingParams actually test the new format.

        There are still few nocommits. I ran 'documetnation-lint' and it was happy.

        Show
        Shai Erera added a comment - Patch handles some nocommits. Now Facet42Codec takes a FacetIndexingParams and builds a HashSet over the fields returned by fip.getAllCLPs(), and uses it in getDVFForField. Also, this codec cannot support facet partitions, since the number of partitions is unknown in advance (each partition corresponds to a field). So I throw an IllegalArgEx. I renamed the package to o.a.l.facet.codecs.facet42 and moved everything under it. Facet42Codec and Facet42DVF are the only public classes. I also added a resources folder which declares the new DVF. Made FacetTestCase randomly select the new codec (30% of the times), all tests pass. Note though that only tests that use the default FacetIndexingParams actually test the new format. There are still few nocommits. I ran 'documetnation-lint' and it was happy.
        Hide
        Shai Erera added a comment -

        Oh, I see that the changes to FacetTestBase can be reverted, as now FacetTestCase picks that codec at random.

        Show
        Shai Erera added a comment - Oh, I see that the changes to FacetTestBase can be reverted, as now FacetTestCase picks that codec at random.
        Hide
        Robert Muir added a comment -

        I don't think that Facet42Codec should be final

        this way it can still be tweaked in the normal ways someone would tweak the default codec, overriding
        getPostings/DocValuesFormatForField, e.g.:

        @Override
        PostingsFormat getPostingsFormatForField(String field) {
          if (field.equals("id")) {
            return memory;
          } else {
            super.getPostingsFormatForField(field);
          }
        }
        
        Show
        Robert Muir added a comment - I don't think that Facet42Codec should be final this way it can still be tweaked in the normal ways someone would tweak the default codec, overriding getPostings/DocValuesFormatForField, e.g.: @Override PostingsFormat getPostingsFormatForField( String field) { if (field.equals( "id" )) { return memory; } else { super .getPostingsFormatForField(field); } }
        Hide
        Shai Erera added a comment -

        You're right. I was just thinking ... until we need to do that, it can be final. But it can also be not final ...

        I'm still not sure what's the best (or common) way to work w/ Codecs. When do you create your own FilterCodec and override the relevant method, and when are you expected to extend another Codec?

        I guess that in this case, since getDVFForField is only on Lucene42Codec, you cannot really extend FilterCodec, so extending Lucene42 / Facet42 is the only option?

        In short, I don't mind if it's not final .

        Show
        Shai Erera added a comment - You're right. I was just thinking ... until we need to do that, it can be final. But it can also be not final ... I'm still not sure what's the best (or common) way to work w/ Codecs. When do you create your own FilterCodec and override the relevant method, and when are you expected to extend another Codec? I guess that in this case, since getDVFForField is only on Lucene42Codec, you cannot really extend FilterCodec, so extending Lucene42 / Facet42 is the only option? In short, I don't mind if it's not final .
        Hide
        Robert Muir added a comment -

        I'm still not sure what's the best (or common) way to work w/ Codecs. When do you create your own FilterCodec and override the relevant method, and when are you expected to extend another Codec?

        The rule is that an index needs to be self-describing: basically I should be able to open any index if i have the right stuff in my classpath.

        This would be somewhat of a burden for users who just want to change their "id" field to use a different postings format (they would have to make a whole codec with their own unique name), and so on.

        So Lucene42Codec uses PerField[Postings/DocValuesFormat] (note this is final!), which separately record the name of the format used on a per-field basis.

        Because of this, its ok that it not final and exposes these hooks to custom postings/docvalues per-field, because it writes the name of those formats into the index for the field.

        Show
        Robert Muir added a comment - I'm still not sure what's the best (or common) way to work w/ Codecs. When do you create your own FilterCodec and override the relevant method, and when are you expected to extend another Codec? The rule is that an index needs to be self-describing: basically I should be able to open any index if i have the right stuff in my classpath. This would be somewhat of a burden for users who just want to change their "id" field to use a different postings format (they would have to make a whole codec with their own unique name), and so on. So Lucene42Codec uses PerField [Postings/DocValuesFormat] (note this is final!), which separately record the name of the format used on a per-field basis. Because of this, its ok that it not final and exposes these hooks to custom postings/docvalues per-field, because it writes the name of those formats into the index for the field.
        Hide
        Shai Erera added a comment -

        I see, thanks for the clarification. So basically, if I'm happy w/ Lucene42Codec, but just was to override a per-field setting (DV or posting), it's simpler that I extend it. Would I also be able to extend FilterCodec, and delegate to Lucene42 the fields that are of not interest to me? Would it still use the PerField thingy for my fields too? Or in that case I'd need to use PerField myself?

        Show
        Shai Erera added a comment - I see, thanks for the clarification. So basically, if I'm happy w/ Lucene42Codec, but just was to override a per-field setting (DV or posting), it's simpler that I extend it. Would I also be able to extend FilterCodec, and delegate to Lucene42 the fields that are of not interest to me? Would it still use the PerField thingy for my fields too? Or in that case I'd need to use PerField myself?
        Hide
        Michael McCandless added a comment -

        New patch, making acceptableOverheadRatio controllable (defaulting to
        PackedInts.DEFAULT), noting the 2GB limitation in the javadocs, making Facet42Codec
        un-final, and added CHANGES.txt entry. I think it's ready!

        Show
        Michael McCandless added a comment - New patch, making acceptableOverheadRatio controllable (defaulting to PackedInts.DEFAULT), noting the 2GB limitation in the javadocs, making Facet42Codec un-final, and added CHANGES.txt entry. I think it's ready!
        Hide
        Shai Erera added a comment -

        Looks good. +1 to commit.

        Show
        Shai Erera added a comment - Looks good. +1 to commit.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1445278

        LUCENE-4764: add more-ram-consuming-but-faster DV format for facets

        Show
        Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1445278 LUCENE-4764 : add more-ram-consuming-but-faster DV format for facets
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1445280

        LUCENE-4764: add more-ram-consuming-but-faster DV format for facets

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1445280 LUCENE-4764 : add more-ram-consuming-but-faster DV format for facets
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development