Lucene - Core
  1. Lucene - Core
  2. LUCENE-4769

Add a CountingFacetsAggregator which reads ordinals from a cache

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 6.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Mike wrote a prototype of a FacetsCollector which reads ordinals from a CachedInts structure on LUCENE-4609. I ported it to the new facets API, as a FacetsAggregator. I think we should offer users the means to use such a cache, even if it consumes more RAM. Mike tests show that this cache consumed x2 more RAM than if the DocValues were loaded into memory in their raw form. Also, a PackedInts version of such cache took almost the same amount of RAM as straight int[], but the gains were minor.

      I will post the patch shortly.

      1. LUCENE-4769.patch
        17 kB
        Shai Erera
      2. LUCENE-4769.patch
        13 kB
        Shai Erera
      3. LUCENE-4769.patch
        7 kB
        Shai Erera

        Activity

        Hide
        Shai Erera added a comment -

        Patch adds CachedIntsCountingFacetsAggregator. I modified CountingFacetsAggregator test to randomly pick it, and it passed for many iterations (still, it does not generate many documents).

        Mike, can you give it a try w/ luceneutil? Measure the time differences as well as the RAM footprint?

        If it works and we'll decide to keep it (not as default though!), then I want to create an AbstractCountingFacetsAggregator which will implement rollupValues and requiresScore, which are now duplicated among 3 counting aggregators. The other aggregators should just implement aggregate.

        Show
        Shai Erera added a comment - Patch adds CachedIntsCountingFacetsAggregator. I modified CountingFacetsAggregator test to randomly pick it, and it passed for many iterations (still, it does not generate many documents). Mike, can you give it a try w/ luceneutil? Measure the time differences as well as the RAM footprint? If it works and we'll decide to keep it (not as default though!), then I want to create an AbstractCountingFacetsAggregator which will implement rollupValues and requiresScore, which are now duplicated among 3 counting aggregators. The other aggregators should just implement aggregate.
        Hide
        Michael McCandless added a comment -

        Full (6.6M) wikibig index, 7 facet dims:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         Respell       46.60      (3.4%)       45.82      (4.1%)   -1.7% (  -8% -    6%)
                    HighSpanNear        3.49      (1.7%)        3.51      (2.2%)    0.8% (  -3% -    4%)
                      HighPhrase       17.13     (10.5%)       17.42     (11.0%)    1.7% ( -17% -   26%)
                          Fuzzy2       53.25      (2.8%)       54.19      (3.1%)    1.8% (  -4% -    7%)
                      AndHighLow      587.43      (2.3%)      597.84      (2.6%)    1.8% (  -3% -    6%)
                 LowSloppyPhrase       20.30      (2.3%)       20.68      (2.3%)    1.9% (  -2% -    6%)
                     LowSpanNear        8.24      (2.3%)        8.42      (2.9%)    2.1% (  -3% -    7%)
                     AndHighHigh       23.36      (1.3%)       23.95      (0.9%)    2.5% (   0% -    4%)
                HighSloppyPhrase        0.92      (5.1%)        0.94      (6.1%)    2.8% (  -7% -   14%)
                       LowPhrase       21.02      (6.2%)       21.63      (6.7%)    2.9% (  -9% -   16%)
                     MedSpanNear       28.31      (1.3%)       29.20      (1.5%)    3.1% (   0% -    6%)
                 MedSloppyPhrase       25.98      (1.7%)       26.79      (1.7%)    3.1% (   0% -    6%)
                         MedTerm       47.54      (1.9%)       49.49      (3.4%)    4.1% (  -1% -    9%)
                          Fuzzy1       47.28      (2.2%)       49.27      (2.6%)    4.2% (   0% -    9%)
                      AndHighMed      105.55      (0.9%)      112.03      (1.2%)    6.1% (   3% -    8%)
                        Wildcard       27.63      (1.2%)       30.03      (1.6%)    8.7% (   5% -   11%)
                       MedPhrase      109.43      (5.6%)      122.45      (7.4%)   11.9% (   0% -   26%)
                         LowTerm      110.94      (1.9%)      128.73      (1.8%)   16.0% (  12% -   20%)
                       OrHighLow       17.11      (2.2%)       22.44      (3.7%)   31.1% (  24% -   37%)
                       OrHighMed       16.63      (2.1%)       21.89      (3.8%)   31.6% (  25% -   38%)
                        HighTerm       19.17      (1.9%)       26.30      (3.5%)   37.2% (  31% -   43%)
                      OrHighHigh        8.77      (2.4%)       12.45      (4.7%)   42.1% (  34% -   50%)
                         Prefix3       13.06      (1.8%)       18.66      (2.2%)   42.9% (  38% -   47%)
                          IntNRQ        3.59      (1.6%)        6.45      (3.3%)   79.8% (  73% -   86%)
        

        trunk DVs take 61.4 MB while the int[] cache takes 202.9 MB (3.3X
        more). Also, if users use the int[] cache they must remember to use
        (and maybe we check / warn about it somehow) a disk-backed DV else
        it's silly since you'd be double-caching in RAM.

        Curiously these gains are not that much better (except IntNRQ) than
        LUCENE-4764, which was only ~31% larger... which is odd because we had
        previously tested [close to] LUCENE-4764 against int[] cache and it
        was faster.

        Show
        Michael McCandless added a comment - Full (6.6M) wikibig index, 7 facet dims: Task QPS base StdDev QPS comp StdDev Pct diff Respell 46.60 (3.4%) 45.82 (4.1%) -1.7% ( -8% - 6%) HighSpanNear 3.49 (1.7%) 3.51 (2.2%) 0.8% ( -3% - 4%) HighPhrase 17.13 (10.5%) 17.42 (11.0%) 1.7% ( -17% - 26%) Fuzzy2 53.25 (2.8%) 54.19 (3.1%) 1.8% ( -4% - 7%) AndHighLow 587.43 (2.3%) 597.84 (2.6%) 1.8% ( -3% - 6%) LowSloppyPhrase 20.30 (2.3%) 20.68 (2.3%) 1.9% ( -2% - 6%) LowSpanNear 8.24 (2.3%) 8.42 (2.9%) 2.1% ( -3% - 7%) AndHighHigh 23.36 (1.3%) 23.95 (0.9%) 2.5% ( 0% - 4%) HighSloppyPhrase 0.92 (5.1%) 0.94 (6.1%) 2.8% ( -7% - 14%) LowPhrase 21.02 (6.2%) 21.63 (6.7%) 2.9% ( -9% - 16%) MedSpanNear 28.31 (1.3%) 29.20 (1.5%) 3.1% ( 0% - 6%) MedSloppyPhrase 25.98 (1.7%) 26.79 (1.7%) 3.1% ( 0% - 6%) MedTerm 47.54 (1.9%) 49.49 (3.4%) 4.1% ( -1% - 9%) Fuzzy1 47.28 (2.2%) 49.27 (2.6%) 4.2% ( 0% - 9%) AndHighMed 105.55 (0.9%) 112.03 (1.2%) 6.1% ( 3% - 8%) Wildcard 27.63 (1.2%) 30.03 (1.6%) 8.7% ( 5% - 11%) MedPhrase 109.43 (5.6%) 122.45 (7.4%) 11.9% ( 0% - 26%) LowTerm 110.94 (1.9%) 128.73 (1.8%) 16.0% ( 12% - 20%) OrHighLow 17.11 (2.2%) 22.44 (3.7%) 31.1% ( 24% - 37%) OrHighMed 16.63 (2.1%) 21.89 (3.8%) 31.6% ( 25% - 38%) HighTerm 19.17 (1.9%) 26.30 (3.5%) 37.2% ( 31% - 43%) OrHighHigh 8.77 (2.4%) 12.45 (4.7%) 42.1% ( 34% - 50%) Prefix3 13.06 (1.8%) 18.66 (2.2%) 42.9% ( 38% - 47%) IntNRQ 3.59 (1.6%) 6.45 (3.3%) 79.8% ( 73% - 86%) trunk DVs take 61.4 MB while the int[] cache takes 202.9 MB (3.3X more). Also, if users use the int[] cache they must remember to use (and maybe we check / warn about it somehow) a disk-backed DV else it's silly since you'd be double-caching in RAM. Curiously these gains are not that much better (except IntNRQ) than LUCENE-4764 , which was only ~31% larger... which is odd because we had previously tested [close to] LUCENE-4764 against int[] cache and it was faster.
        Hide
        Shai Erera added a comment -

        Previously, on LUCENE-4609, we tested on the full 2.2M ordinals, which produced larger DVs. So I think in order to compare to those results, we need to repeat that test on the full set of ordinals. And separately, it would be good to tell whether LUCENE-4764 is also faster (and if so, by how much) on the full set of ordinals.

        Then, we can compare fast-DV to int[].

        Show
        Shai Erera added a comment - Previously, on LUCENE-4609 , we tested on the full 2.2M ordinals, which produced larger DVs. So I think in order to compare to those results, we need to repeat that test on the full set of ordinals. And separately, it would be good to tell whether LUCENE-4764 is also faster (and if so, by how much) on the full set of ordinals. Then, we can compare fast-DV to int[].
        Hide
        Michael McCandless added a comment -

        Full (6.6M docs) wikibig, 9 dims. Base is trunk, comp is int[] cache:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         Respell       44.87      (3.2%)       44.32      (3.8%)   -1.2% (  -7% -    5%)
                      AndHighLow       64.10      (1.8%)       63.33      (1.1%)   -1.2% (  -4% -    1%)
                     LowSpanNear        7.10      (2.1%)        7.18      (2.0%)    1.1% (  -2% -    5%)
                          Fuzzy2       28.55      (2.0%)       29.06      (1.6%)    1.8% (  -1% -    5%)
                      HighPhrase       13.69      (8.7%)       13.98      (8.3%)    2.1% ( -13% -   20%)
                 LowSloppyPhrase       15.40      (2.6%)       15.85      (1.6%)    3.0% (  -1% -    7%)
                      AndHighMed       39.48      (1.0%)       41.07      (0.8%)    4.0% (   2% -    5%)
                    HighSpanNear        2.91      (1.3%)        3.03      (1.1%)    4.2% (   1% -    6%)
                       LowPhrase       15.01      (4.4%)       15.80      (4.7%)    5.2% (  -3% -   14%)
                     MedSpanNear       17.68      (1.4%)       18.87      (1.2%)    6.7% (   3% -    9%)
                 MedSloppyPhrase       16.56      (1.3%)       17.82      (1.3%)    7.6% (   5% -   10%)
                       MedPhrase       41.08      (2.5%)       44.27      (2.8%)    7.8% (   2% -   13%)
                          Fuzzy1       24.08      (1.5%)       25.97      (1.9%)    7.9% (   4% -   11%)
                HighSloppyPhrase        0.82      (6.2%)        0.89      (5.6%)    9.1% (  -2% -   22%)
                         LowTerm       34.22      (1.1%)       39.13      (1.3%)   14.3% (  11% -   16%)
                     AndHighHigh       11.87      (1.3%)       13.96      (1.2%)   17.6% (  14% -   20%)
                        Wildcard       13.02      (1.9%)       16.02      (1.5%)   23.1% (  19% -   27%)
                         MedTerm       20.04      (2.2%)       24.86      (2.4%)   24.1% (  19% -   29%)
                       OrHighMed        7.02      (2.7%)        9.85      (2.3%)   40.3% (  34% -   46%)
                       OrHighLow        7.08      (2.8%)        9.95      (2.8%)   40.5% (  33% -   47%)
                        HighTerm        7.52      (2.5%)       10.78      (2.1%)   43.4% (  37% -   49%)
                         Prefix3        5.71      (2.4%)        8.51      (1.5%)   48.9% (  43% -   54%)
                      OrHighHigh        3.99      (2.3%)        6.26      (2.6%)   56.8% (  50% -   63%)
                          IntNRQ        1.91      (2.3%)        3.66      (3.0%)   91.9% (  84% -   99%)
        

        Cache is 291.5 MB, and trunk is 129.0 MB = 2.26X larger.

        Show
        Michael McCandless added a comment - Full (6.6M docs) wikibig, 9 dims. Base is trunk, comp is int[] cache: Task QPS base StdDev QPS comp StdDev Pct diff Respell 44.87 (3.2%) 44.32 (3.8%) -1.2% ( -7% - 5%) AndHighLow 64.10 (1.8%) 63.33 (1.1%) -1.2% ( -4% - 1%) LowSpanNear 7.10 (2.1%) 7.18 (2.0%) 1.1% ( -2% - 5%) Fuzzy2 28.55 (2.0%) 29.06 (1.6%) 1.8% ( -1% - 5%) HighPhrase 13.69 (8.7%) 13.98 (8.3%) 2.1% ( -13% - 20%) LowSloppyPhrase 15.40 (2.6%) 15.85 (1.6%) 3.0% ( -1% - 7%) AndHighMed 39.48 (1.0%) 41.07 (0.8%) 4.0% ( 2% - 5%) HighSpanNear 2.91 (1.3%) 3.03 (1.1%) 4.2% ( 1% - 6%) LowPhrase 15.01 (4.4%) 15.80 (4.7%) 5.2% ( -3% - 14%) MedSpanNear 17.68 (1.4%) 18.87 (1.2%) 6.7% ( 3% - 9%) MedSloppyPhrase 16.56 (1.3%) 17.82 (1.3%) 7.6% ( 5% - 10%) MedPhrase 41.08 (2.5%) 44.27 (2.8%) 7.8% ( 2% - 13%) Fuzzy1 24.08 (1.5%) 25.97 (1.9%) 7.9% ( 4% - 11%) HighSloppyPhrase 0.82 (6.2%) 0.89 (5.6%) 9.1% ( -2% - 22%) LowTerm 34.22 (1.1%) 39.13 (1.3%) 14.3% ( 11% - 16%) AndHighHigh 11.87 (1.3%) 13.96 (1.2%) 17.6% ( 14% - 20%) Wildcard 13.02 (1.9%) 16.02 (1.5%) 23.1% ( 19% - 27%) MedTerm 20.04 (2.2%) 24.86 (2.4%) 24.1% ( 19% - 29%) OrHighMed 7.02 (2.7%) 9.85 (2.3%) 40.3% ( 34% - 46%) OrHighLow 7.08 (2.8%) 9.95 (2.8%) 40.5% ( 33% - 47%) HighTerm 7.52 (2.5%) 10.78 (2.1%) 43.4% ( 37% - 49%) Prefix3 5.71 (2.4%) 8.51 (1.5%) 48.9% ( 43% - 54%) OrHighHigh 3.99 (2.3%) 6.26 (2.6%) 56.8% ( 50% - 63%) IntNRQ 1.91 (2.3%) 3.66 (3.0%) 91.9% ( 84% - 99%) Cache is 291.5 MB, and trunk is 129.0 MB = 2.26X larger.
        Hide
        Shai Erera added a comment -

        Nice results. x2.26 more on such large DVs is something you can live with . And the gains are better than fast DV, I guess because when u have that many ords, the VInt+Dgap decoding time is significant, and stands out more.

        So I think we should commit this Aggregator?

        Show
        Shai Erera added a comment - Nice results. x2.26 more on such large DVs is something you can live with . And the gains are better than fast DV, I guess because when u have that many ords, the VInt+Dgap decoding time is significant, and stands out more. So I think we should commit this Aggregator?
        Hide
        Robert Muir added a comment -

        I dont know what an aggregator is, but this could also be done as a codec (similar fashion as 4764)?

        isn't it basically DirectFacetingDocValuesFormat, would just hold its data as int[]s and the collector could get at it like 4764?

        Show
        Robert Muir added a comment - I dont know what an aggregator is, but this could also be done as a codec (similar fashion as 4764)? isn't it basically DirectFacetingDocValuesFormat, would just hold its data as int[]s and the collector could get at it like 4764?
        Hide
        Shai Erera added a comment -

        FacetsAggregator is an abstraction of the facets package that lets you compute different functions on the aggregated ordinals. E.g. counting is equivalent to #sum(1), while SumScoreFacetsAggregator does #sum(score) etc.

        You're right that this could be implemented as a Codec, and then we won't even need to alert the user that if he uses that caching method, he should use DiskValuesFormat. But it looks an awkward decision to me. Usually, caching does not force you to index stuff in a specific way. Rather, you decide at runtime if you want to cache the data or not. You can even choose to stop using the cache, while the app is running. Also, it's odd that if the app already indexed documents with the default Codec, it won't be able to using this caching method, unless it reindexes, or until those segments are merged (b/c their DVFormat will be different, and so the aggregator would need to revert to a different counting code).

        I dunno ... it's certainly doable, but it doesn't feel right to me.

        Show
        Shai Erera added a comment - FacetsAggregator is an abstraction of the facets package that lets you compute different functions on the aggregated ordinals. E.g. counting is equivalent to #sum(1), while SumScoreFacetsAggregator does #sum(score) etc. You're right that this could be implemented as a Codec, and then we won't even need to alert the user that if he uses that caching method, he should use DiskValuesFormat. But it looks an awkward decision to me. Usually, caching does not force you to index stuff in a specific way. Rather, you decide at runtime if you want to cache the data or not. You can even choose to stop using the cache, while the app is running. Also, it's odd that if the app already indexed documents with the default Codec, it won't be able to using this caching method, unless it reindexes, or until those segments are merged (b/c their DVFormat will be different, and so the aggregator would need to revert to a different counting code). I dunno ... it's certainly doable, but it doesn't feel right to me.
        Hide
        Robert Muir added a comment -

        I dont look at it as a cache, but a different codec representation that holds the stuff in int[]s.

        Just like DirectPostingsFormat.

        Show
        Robert Muir added a comment - I dont look at it as a cache, but a different codec representation that holds the stuff in int[]s. Just like DirectPostingsFormat.
        Hide
        Shai Erera added a comment -

        It's not like DirectPostingsFormat though. DPF hides the int[] from you, and you interact with the general API, not knowing that under the covers it does things more efficiently. I think that on LUCENE-4764, if we can prove that this specialization doesn't help much (i.e. you don't need to cast to FacetsDV and pull the addresses and bytes), then it'd be compelling. And if we had a DV type that had .get(doc, IntsRef), then an int[] DVFormat would also make sense.

        But if we implement that as a Codec, then the app would need to set both the Codec and the matching FacetsAggregator. Also, it will be ineffective to use this Codec on existing large indexes, as you won't gain anything. I treat this like FieldCache .. you have something indexed one way, and read another way. Again, if there was a DVFormat that would let me ask for all integers of a document, it'd be a different story I think.

        Show
        Shai Erera added a comment - It's not like DirectPostingsFormat though. DPF hides the int[] from you, and you interact with the general API, not knowing that under the covers it does things more efficiently. I think that on LUCENE-4764 , if we can prove that this specialization doesn't help much (i.e. you don't need to cast to FacetsDV and pull the addresses and bytes), then it'd be compelling. And if we had a DV type that had .get(doc, IntsRef), then an int[] DVFormat would also make sense. But if we implement that as a Codec, then the app would need to set both the Codec and the matching FacetsAggregator. Also, it will be ineffective to use this Codec on existing large indexes, as you won't gain anything. I treat this like FieldCache .. you have something indexed one way, and read another way. Again, if there was a DVFormat that would let me ask for all integers of a document, it'd be a different story I think.
        Hide
        Robert Muir added a comment -

        i dont think we should add a docvalues format for that though: because what lucene/facets is doing isn't a natural search use case.

        its (ab)using DV as a way to store custom per-doc pointers to its own separate data structures: i think the approach mike is taking on LUCENE-4764 is right, this encoding and so on should be private to facets.

        you can extends BinaryDocValues (FacetDocValues) and add your own int[] api on top and instanceof: you don't need indexwriter assistance for that.

        Show
        Robert Muir added a comment - i dont think we should add a docvalues format for that though: because what lucene/facets is doing isn't a natural search use case. its (ab)using DV as a way to store custom per-doc pointers to its own separate data structures: i think the approach mike is taking on LUCENE-4764 is right, this encoding and so on should be private to facets. you can extends BinaryDocValues (FacetDocValues) and add your own int[] api on top and instanceof: you don't need indexwriter assistance for that.
        Hide
        Shai Erera added a comment -

        I didn't propose that we add a DV format, I was saying that if there was one, then a DirectFacets format would make sense, b/c the app wouldn't need to write special code to work with it ... it would just return the ints more efficiently.

        And we're abusing DV now, just like we abused payloads before, so nothing has changed .

        I did propose on another issue (forgot where, maybe the migration layer issue?) to develop a FacetsCodec, but you were against it. Perhaps after you worked on DV 2.0 you now think it makes more sense? It will solve a slew of problems I think.

        This FacetsCodec today is mimicked by CategoryListIterator which exposes that getInts API. But Mike and I saw that the DV abstraction (getBytes) + CLI (getInts) hurts performance, therefore the *fast* aggregators / collectors sidestep the CLI abstrtaction and uses only DV. On LUCENE-4764, mike sidesteps the DV abstraction too, which results in more duplicated code. I'm all for those specializations, but it becomes harder to maintain. I just think of all the places we'd need to change if someone will find a better encoding than gap+vint .

        Plus, the specialization doesn't serve the different facet features. I.e. if I'm interested in fast sum-score, I need to write a specialized one. If I'm interested in fast sum-association, I need to write one. Just to be clear, I'm not complaining and I think it makes sense for expert apps to write some specialized code. What I am saying is that if we could make the abstractions FAST, then we'd lower the bar of when apps would need to do that ...

        So far, our latest optimizations only pertain to the counting case. It is the common case and I think it's important that we did that. Perhaps the rest of the API changes also improved the other cases too, but it's clear that if we want to really speed them up, we should specialize them.

        Maybe if we had a FacetsCodec, with CategoryListFormat (an extension to Codec, private to Facets), then LUCENE-4764 and this issue would benefit out-of-the-box all facet features. Because that format will expose what facets need - a getInts API. And if we make this one a Codec and FastDV a Codec, then we anyway force the app to declare a special facets Codec, so at least from that aspect, we won't require more ...

        And if we do a FacetsCodec w/ CategoryListFormat, then at first it can continue to abuse DV, but then in the future we can explore a different format to manage the per-document categories (and support category associations). Maybe even a way to manage the taxonomy in the main index, in its own data structure ...

        Perhaps these two issues show the usefulness of having such Codec?

        Show
        Shai Erera added a comment - I didn't propose that we add a DV format, I was saying that if there was one, then a DirectFacets format would make sense, b/c the app wouldn't need to write special code to work with it ... it would just return the ints more efficiently. And we're abusing DV now, just like we abused payloads before, so nothing has changed . I did propose on another issue (forgot where, maybe the migration layer issue?) to develop a FacetsCodec, but you were against it. Perhaps after you worked on DV 2.0 you now think it makes more sense? It will solve a slew of problems I think. This FacetsCodec today is mimicked by CategoryListIterator which exposes that getInts API. But Mike and I saw that the DV abstraction (getBytes) + CLI (getInts) hurts performance, therefore the *fast* aggregators / collectors sidestep the CLI abstrtaction and uses only DV. On LUCENE-4764 , mike sidesteps the DV abstraction too, which results in more duplicated code. I'm all for those specializations, but it becomes harder to maintain. I just think of all the places we'd need to change if someone will find a better encoding than gap+vint . Plus, the specialization doesn't serve the different facet features. I.e. if I'm interested in fast sum-score, I need to write a specialized one. If I'm interested in fast sum-association, I need to write one. Just to be clear, I'm not complaining and I think it makes sense for expert apps to write some specialized code. What I am saying is that if we could make the abstractions FAST, then we'd lower the bar of when apps would need to do that ... So far, our latest optimizations only pertain to the counting case. It is the common case and I think it's important that we did that. Perhaps the rest of the API changes also improved the other cases too, but it's clear that if we want to really speed them up, we should specialize them. Maybe if we had a FacetsCodec, with CategoryListFormat (an extension to Codec, private to Facets), then LUCENE-4764 and this issue would benefit out-of-the-box all facet features. Because that format will expose what facets need - a getInts API. And if we make this one a Codec and FastDV a Codec, then we anyway force the app to declare a special facets Codec, so at least from that aspect, we won't require more ... And if we do a FacetsCodec w/ CategoryListFormat, then at first it can continue to abuse DV, but then in the future we can explore a different format to manage the per-document categories (and support category associations). Maybe even a way to manage the taxonomy in the main index, in its own data structure ... Perhaps these two issues show the usefulness of having such Codec?
        Hide
        Robert Muir added a comment -

        I still dont get why you need anything special here, more than what Lucene already provides.

        You seem hell-bent on the idea that lucene should have a getInts(docid, IntsRef) api for facets.

        What I'm trying to say is, if you really think thats the API you should have, nothing in lucene stops you from doing this, just add it:

        package org.apache.lucene.facet.index;
        
        abstract class FacetDocValues extends BinaryDocValues {
          abstract void getInts(int docid, IntsRef ref);
        }
        

        What is the problem? You can then make a jazillion implementations like LUCENE-4764 extend that API and have one collector that uses it.

        Show
        Robert Muir added a comment - I still dont get why you need anything special here, more than what Lucene already provides. You seem hell-bent on the idea that lucene should have a getInts(docid, IntsRef) api for facets. What I'm trying to say is, if you really think thats the API you should have, nothing in lucene stops you from doing this, just add it: package org.apache.lucene.facet.index; abstract class FacetDocValues extends BinaryDocValues { abstract void getInts( int docid, IntsRef ref); } What is the problem? You can then make a jazillion implementations like LUCENE-4764 extend that API and have one collector that uses it.
        Hide
        Shai Erera added a comment -

        Ok .. I think I know where the confusion is, and it's mostly due my lack of proper understanding of Codecs ..

        We basically mean the same thing, only what you propose is more realistic w/ today's IndexReader API, which only exposes docValues. While what I had in mind (taking a look again at notes I wrote few months ago) is that facets could have a CompositeReader impl which adds facets specific API. Until then, we have no other choice but to piggy-back on DV API, and that means extending DVFormat. Thanks for insisting, it made me understand how this should work ... (sorry, but I didn't write a Codec yet).

        Perhaps separately we can think about an IndexReader impl for facets, which will open the road to many different optimizations, e.g. maintaining a per-segment taxonomy and top-level reader global-ordinal map (all in-memory), encoding facet ordinals in their own structure (and not DV) and maybe even managing the global taxonomy as part of the search index (through sidecar files or something), w/o the sidecar index, which I think today is a barrier for apps as well as integrating that into Solr or ES. But that should be done separately as it's a major refactoring to how facets work.

        Even FacetsDV are sort of a refactoring (i.e. replacing CategoryListIterator with that .. if we want to do it right), so I think that for now I'm going to still commit that cache as an aggregator and we can get rid of it once we do FacetsDV.

        Oh .. and there was one thing that bothered me in that statement:

        You seem hell-bent on the idea that lucene should have a getInts(docid, IntsRef) api for facets

        First, I'm not hell-bent on anything (don't even know what that means). Second, facets are now a *lucene* module, and not private to me. From my perspective, lucene doesn't need to have anything for me, but lucene should have the best facets module. So far I've been busy refactoring facets so they work faster and have cleaner API ... not to me, to lucene users. I'm sure things can be simplified even further and improved even more. I think about it constantly. If you have a better idea of how facets should work (while maintaining current capabilities, as much as possible), I'm all open to suggestions, really.

        Show
        Shai Erera added a comment - Ok .. I think I know where the confusion is, and it's mostly due my lack of proper understanding of Codecs .. We basically mean the same thing, only what you propose is more realistic w/ today's IndexReader API, which only exposes docValues. While what I had in mind (taking a look again at notes I wrote few months ago) is that facets could have a CompositeReader impl which adds facets specific API. Until then, we have no other choice but to piggy-back on DV API, and that means extending DVFormat. Thanks for insisting, it made me understand how this should work ... (sorry, but I didn't write a Codec yet). Perhaps separately we can think about an IndexReader impl for facets, which will open the road to many different optimizations, e.g. maintaining a per-segment taxonomy and top-level reader global-ordinal map (all in-memory), encoding facet ordinals in their own structure (and not DV) and maybe even managing the global taxonomy as part of the search index (through sidecar files or something), w/o the sidecar index, which I think today is a barrier for apps as well as integrating that into Solr or ES. But that should be done separately as it's a major refactoring to how facets work. Even FacetsDV are sort of a refactoring (i.e. replacing CategoryListIterator with that .. if we want to do it right), so I think that for now I'm going to still commit that cache as an aggregator and we can get rid of it once we do FacetsDV. Oh .. and there was one thing that bothered me in that statement: You seem hell-bent on the idea that lucene should have a getInts(docid, IntsRef) api for facets First, I'm not hell-bent on anything (don't even know what that means). Second, facets are now a *lucene* module, and not private to me. From my perspective, lucene doesn't need to have anything for me, but lucene should have the best facets module. So far I've been busy refactoring facets so they work faster and have cleaner API ... not to me, to lucene users. I'm sure things can be simplified even further and improved even more. I think about it constantly. If you have a better idea of how facets should work (while maintaining current capabilities, as much as possible), I'm all open to suggestions, really.
        Hide
        Shai Erera added a comment -

        Added IntRollupFacetsAggregator which is an abstract FacetsAggregator that implements rollupValues by summing the values from FacetArrays.getIntArray(). Moved the 3 counting aggregators (including the one in this patch) to extend it and focus on implementing aggregate only.

        Show
        Shai Erera added a comment - Added IntRollupFacetsAggregator which is an abstract FacetsAggregator that implements rollupValues by summing the values from FacetArrays.getIntArray(). Moved the 3 counting aggregators (including the one in this patch) to extend it and focus on implementing aggregate only.
        Hide
        Robert Muir added a comment -

        Until then, we have no other choice but to piggy-back on DV API, and that means extending DVFormat.

        Well mainly I'm trying to make sure we only have the minimum DocValues types and APIs we actually need. Additional types are very costly to us.

        I'm still unsure myself that lucene should have a byte[] docvalues type that is unsorted: I don't see any real use cases for it directly.

        But for someone who wants to encode their own data structures, having a per-document byte[] type where your codec can see all the values is pretty powerful. So if having this "catch-all" type prevents additional types from being added to lucene, then maybe its worth it.

        Perhaps separately we can think about an IndexReader impl for facets, which will open the road to many different optimizations, e.g. maintaining a per-segment taxonomy and top-level reader global-ordinal map (all in-memory), encoding facet ordinals in their own structure (and not DV) and maybe even managing the global taxonomy as part of the search index (through sidecar files or something), w/o the sidecar index, which I think today is a barrier for apps as well as integrating that into Solr or ES. But that should be done separately as it's a major refactoring to how facets work.

        I think a custom IndexReader impl would prevent barriers for integration with those systems too, just in a different way. Personally I think the current design (sidecar) is the most performant. But we should consider adding other possibilities to lucene that make different tradeoffs, e.g. work without it.

        First, I'm not hell-bent on anything (don't even know what that means). Second, facets are now a lucene module, and not private to me. From my perspective, lucene doesn't need to have anything for me, but lucene should have the best facets module. So far I've been busy refactoring facets so they work faster and have cleaner API ... not to me, to lucene users. I'm sure things can be simplified even further and improved even more. I think about it constantly. If you have a better idea of how facets should work (while maintaining current capabilities, as much as possible), I'm all open to suggestions, really.

        I know, you are doing a great job. I'm just explaining my opinion on this situation: having facets "build on top of" BinaryDocValues doesnt hurt it in the slightest. Sometimes I wonder if you are having this argument with me to avoid a single type cast in the facets codebase or for some other cosmetic reason

        Show
        Robert Muir added a comment - Until then, we have no other choice but to piggy-back on DV API, and that means extending DVFormat. Well mainly I'm trying to make sure we only have the minimum DocValues types and APIs we actually need. Additional types are very costly to us. I'm still unsure myself that lucene should have a byte[] docvalues type that is unsorted: I don't see any real use cases for it directly. But for someone who wants to encode their own data structures, having a per-document byte[] type where your codec can see all the values is pretty powerful. So if having this "catch-all" type prevents additional types from being added to lucene, then maybe its worth it. Perhaps separately we can think about an IndexReader impl for facets, which will open the road to many different optimizations, e.g. maintaining a per-segment taxonomy and top-level reader global-ordinal map (all in-memory), encoding facet ordinals in their own structure (and not DV) and maybe even managing the global taxonomy as part of the search index (through sidecar files or something), w/o the sidecar index, which I think today is a barrier for apps as well as integrating that into Solr or ES. But that should be done separately as it's a major refactoring to how facets work. I think a custom IndexReader impl would prevent barriers for integration with those systems too, just in a different way. Personally I think the current design (sidecar) is the most performant. But we should consider adding other possibilities to lucene that make different tradeoffs, e.g. work without it. First, I'm not hell-bent on anything (don't even know what that means). Second, facets are now a lucene module, and not private to me. From my perspective, lucene doesn't need to have anything for me, but lucene should have the best facets module. So far I've been busy refactoring facets so they work faster and have cleaner API ... not to me, to lucene users. I'm sure things can be simplified even further and improved even more. I think about it constantly. If you have a better idea of how facets should work (while maintaining current capabilities, as much as possible), I'm all open to suggestions, really. I know, you are doing a great job. I'm just explaining my opinion on this situation: having facets "build on top of" BinaryDocValues doesnt hurt it in the slightest. Sometimes I wonder if you are having this argument with me to avoid a single type cast in the facets codebase or for some other cosmetic reason
        Hide
        Shai Erera added a comment -

        Sometimes I wonder if you are having this argument with me to avoid a single type cast in the facets codebase or for some other cosmetic reason

        Not at all. I'm looking for the cleanest solution. I'm just much less familiar that you when it comes to Codecs and Formats, and therefore I fail to think of ways to piggy-back (abuse? ) them for facets.

        I certainly think your FacetsBDV idea is good. I don't mind casting if it will help the API. Basically .getInts() is the API of CategoryListIterator today ... maybe we can nuke CLI in favor of FacetsBDV? Definitely worth looking at. Lets's explore that in a separate issue though. So according to the results on LUCENE-4764, we don't need to do any specialization to get the bytes of a certain document. And in the future, with FacetsBDV we won't need CachedInts as aggregator, but as another Codec, and again, won't need to specialize (I hope!).

        And if we ever develop a custom IR for facets, we can add the .getInts API higher-up the chain, and not necessarily depend on DocValues.

        Show
        Shai Erera added a comment - Sometimes I wonder if you are having this argument with me to avoid a single type cast in the facets codebase or for some other cosmetic reason Not at all. I'm looking for the cleanest solution. I'm just much less familiar that you when it comes to Codecs and Formats, and therefore I fail to think of ways to piggy-back (abuse? ) them for facets. I certainly think your FacetsBDV idea is good. I don't mind casting if it will help the API. Basically .getInts() is the API of CategoryListIterator today ... maybe we can nuke CLI in favor of FacetsBDV? Definitely worth looking at. Lets's explore that in a separate issue though. So according to the results on LUCENE-4764 , we don't need to do any specialization to get the bytes of a certain document. And in the future, with FacetsBDV we won't need CachedInts as aggregator, but as another Codec, and again, won't need to specialize (I hope!). And if we ever develop a custom IR for facets, we can add the .getInts API higher-up the chain, and not necessarily depend on DocValues.
        Hide
        Shai Erera added a comment -

        I wonder, since CachedInts for a segment is created only once, if we should make it more generic, and work w/ IntDecoder instead of hard-coding to dgap+vint? Then, we can make this class top-level, e.g. as OrdinalsCache with a static getCachedInts/Ords for a segment? Otherwise, I'll add an assert to CachedIntsCountingFacetsAggregator to ensure the decoder used is DGapVInt.

        Show
        Shai Erera added a comment - I wonder, since CachedInts for a segment is created only once, if we should make it more generic, and work w/ IntDecoder instead of hard-coding to dgap+vint? Then, we can make this class top-level, e.g. as OrdinalsCache with a static getCachedInts/Ords for a segment? Otherwise, I'll add an assert to CachedIntsCountingFacetsAggregator to ensure the decoder used is DGapVInt.
        Hide
        Shai Erera added a comment -

        Patch adds OrdinalsCache to handle the caching and CachedOrdsCountingFacetsAggregator uses it. OrdinalsCache uses IntDecoder to decode the values from the category list.

        I think this is ready to commit.

        Show
        Shai Erera added a comment - Patch adds OrdinalsCache to handle the caching and CachedOrdsCountingFacetsAggregator uses it. OrdinalsCache uses IntDecoder to decode the values from the category list. I think this is ready to commit.
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1445234

        LUCENE-4769: Add a CountingFacetsAggregator which reads ordinals from a cache

        Show
        Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1445234 LUCENE-4769 : Add a CountingFacetsAggregator which reads ordinals from a cache
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x

        Show
        Shai Erera added a comment - Committed to trunk and 4x
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1445251

        LUCENE-4769: Add a CountingFacetsAggregator which reads ordinals from a cache

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1445251 LUCENE-4769 : Add a CountingFacetsAggregator which reads ordinals from a cache
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development