Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.2, 6.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      With the move of OrdinalPolicy to CategoryListParams, NonTopLevelOrdinalPolicy was nuked. It might be good to restore it, as another enum value of OrdinalPolicy.

      It's the same like ALL_PARENTS, only doesn't add the dimension ordinal, which could save space as well as computation time. It's good for when you don't care about the count of Date/, but only about its children counts.

      1. LUCENE-4715.patch
        57 kB
        Shai Erera
      2. LUCENE-4715.patch
        54 kB
        Shai Erera

        Activity

        Hide
        Michael McCandless added a comment -

        +1

        This is very compelling because you also do not need to do the rollup in the end, so you get the gains of NO_PARENTS without the cost (end rollup), so even for indices w/ many unique ords there's no penalty.

        I also suspect this is common ... ie most faceted search UIs I've seen do not include the dim facet count, just the values under that dim.

        Show
        Michael McCandless added a comment - +1 This is very compelling because you also do not need to do the rollup in the end, so you get the gains of NO_PARENTS without the cost (end rollup), so even for indices w/ many unique ords there's no penalty. I also suspect this is common ... ie most faceted search UIs I've seen do not include the dim facet count, just the values under that dim.
        Hide
        Shai Erera added a comment -

        so you get the gains of NO_PARENTS without the cost (end rollup)

        I'd like to clarify the difference between the 3 OrdinalPolicies, through example. Say that you add the facet Date/2010/March/11, then the ordinals that will be encoded are of:

        • ALL_PARENTS: Date, Date/2010, Date/2010/March, Date/2010/March/11 (4 ordinals)
        • NO_PARENTS: Date/2010/March/11 (1 ordinal)
        • ALL_BUT_DIMENSION: Date/2010, Date/2010/March, Date/2010/March/11 (3 ordinals)

        The aggregation of ALL_PARENTS and ALL_BUT_DIM is the same, but you don't get the count of Date/ with the latter policy.
        The aggregation of NO_PARENTS is the same as ALL_PARENTS, only you need to do the rollup in the end.

        Maybe we should have just ALL and ALL_BUT_DIM? Given the results of NO_PARENTS (not that significant gains), and the complexity involved (i.e. currently only CountingFC supports it) ...

        But I just wanted to clarify the differences, because the ALL_BUT_DIM is not that much different than ALL, except that you'll encode and decode the same ordinal for many documents, so that has got to buy us something ...

        Show
        Shai Erera added a comment - so you get the gains of NO_PARENTS without the cost (end rollup) I'd like to clarify the difference between the 3 OrdinalPolicies, through example. Say that you add the facet Date/2010/March/11, then the ordinals that will be encoded are of: ALL_PARENTS: Date, Date/2010, Date/2010/March, Date/2010/March/11 (4 ordinals) NO_PARENTS: Date/2010/March/11 (1 ordinal) ALL_BUT_DIMENSION: Date/2010, Date/2010/March, Date/2010/March/11 (3 ordinals) The aggregation of ALL_PARENTS and ALL_BUT_DIM is the same, but you don't get the count of Date/ with the latter policy. The aggregation of NO_PARENTS is the same as ALL_PARENTS, only you need to do the rollup in the end. Maybe we should have just ALL and ALL_BUT_DIM? Given the results of NO_PARENTS (not that significant gains), and the complexity involved (i.e. currently only CountingFC supports it) ... But I just wanted to clarify the differences, because the ALL_BUT_DIM is not that much different than ALL, except that you'll encode and decode the same ordinal for many documents, so that has got to buy us something ...
        Hide
        Shai Erera added a comment -
        • Patch adds OrdinalPolicy.NO_DIMENSION and modifies CountingListBuilder to discard the last added ordinal (the dimension) in that case.
        • CountingFacetsCollector modified to rollup parents' counts when dimensions are indexed with NO_PARENTS.
        • CategoryListParams.getOrdinals modified to take a String (dimension). The default impl returns DEFAULT_ORDINAL_POLICY and PerDimensionOrdinalPolicy supports returning different policies for different dimensions.
        • OrdinalPolicy.NO_DIMENSION made the default, as usually the count of dimension is not needed by apps.
          • Note though that when children of the dimension are specified as the root category of the FacetRequest, they will be counted too.
          • For all practical purposes, NO_DIMENSION and ALL_PARENTS behave the same by that they count whatever is encoded in the counting list, and don't fix parents' counts in the end (nor check if a certain ordinal needs to be counted or not).

        All tests pass, I think this is ready.

        Show
        Shai Erera added a comment - Patch adds OrdinalPolicy.NO_DIMENSION and modifies CountingListBuilder to discard the last added ordinal (the dimension) in that case. CountingFacetsCollector modified to rollup parents' counts when dimensions are indexed with NO_PARENTS. CategoryListParams.getOrdinals modified to take a String (dimension). The default impl returns DEFAULT_ORDINAL_POLICY and PerDimensionOrdinalPolicy supports returning different policies for different dimensions. OrdinalPolicy.NO_DIMENSION made the default, as usually the count of dimension is not needed by apps. Note though that when children of the dimension are specified as the root category of the FacetRequest , they will be counted too. For all practical purposes, NO_DIMENSION and ALL_PARENTS behave the same by that they count whatever is encoded in the counting list, and don't fix parents' counts in the end (nor check if a certain ordinal needs to be counted or not). All tests pass, I think this is ready.
        Hide
        Gilad Barkai added a comment -

        Looking at the patch, I think I might misunderstood something - in the build method, for every category the right policy is checked, but the build itself is per CategoryListParam - so why cant the policy be the same for each CLP? If one wishes to get different policies etc - I think it would be logical to separate them to different clps, and this check should not be performed over each category?

        Show
        Gilad Barkai added a comment - Looking at the patch, I think I might misunderstood something - in the build method, for every category the right policy is checked, but the build itself is per CategoryListParam - so why cant the policy be the same for each CLP? If one wishes to get different policies etc - I think it would be logical to separate them to different clps, and this check should not be performed over each category?
        Hide
        Shai Erera added a comment -

        The thing is that there are two dimensions here: CategoryListParams and OrdinalPolicy for a category dimension:

        • Different CLPs are good for when an application has good reasons to group different categories into different category lists, and then at search time request different groups of facets. E.g. an eCommerce application will probably have different facets for Cameras and Shoes, and therefore it would make sense to separate the facets into different lists.
        • However, Mike and I saw that when you index hierarchical facets, then indexing them as NO_PARENTS yields better performance than ALL_PARENTS (b/c e.g. less ordinals are read), even when the parents' counts are rolled up in the end.
          • Having said that, we also experimented with separating dimensions to separate field (one field per dimension), but that yielded worse results than grouping them together.
          • So on one hand we want to have different OrdinalPolicy for different dimensions, but on the other hand, still group categories under the same CLP.

        When I started to work on that issue, I did just like as you suggest – use PerDimensionIndexingParams, and pass different CLP instances for different dimensions, yet still keep the CLP.field the same for dimensions that "should go together".

        But that complicated matters for FacetFields, b/c it first groups all CPs under their respective CLPs, and creates a Map<CategoryListParams,Iterable<CategoryPath>>. Then all the CPs of the same CLP are passed to CountingListBuilder.

        If I wanted to work w/ PerDimensionIndexingParams, I'd need to change FacetFields to map from a String -> (map of CLP -> CP) and then change CountingListBuilder accordingly. Also, CountingFacetsCollector would need to change as well, since currently it assumes a single CLP instance.

        In short, while this is doable, I think it's confusing, and could lead apps to think that if you need different OrdinalPolicy for dimensions, you also need different CLPs, and consequently different fields, which is bad!

        So I think that solution is good – whoever intends to control OrdinalPolicy, would need to create some Map, so with this approach, he'll create a Map(String,OrdinalPolicy). If he needs both worlds (multiple CLPs AND OrdinalPolicy-ies), then he needs to create two Maps ... doesn't sound a big deal to me.

        Show
        Shai Erera added a comment - The thing is that there are two dimensions here: CategoryListParams and OrdinalPolicy for a category dimension: Different CLPs are good for when an application has good reasons to group different categories into different category lists, and then at search time request different groups of facets. E.g. an eCommerce application will probably have different facets for Cameras and Shoes, and therefore it would make sense to separate the facets into different lists. However, Mike and I saw that when you index hierarchical facets, then indexing them as NO_PARENTS yields better performance than ALL_PARENTS (b/c e.g. less ordinals are read), even when the parents' counts are rolled up in the end. Having said that, we also experimented with separating dimensions to separate field (one field per dimension), but that yielded worse results than grouping them together. So on one hand we want to have different OrdinalPolicy for different dimensions, but on the other hand, still group categories under the same CLP. When I started to work on that issue, I did just like as you suggest – use PerDimensionIndexingParams, and pass different CLP instances for different dimensions, yet still keep the CLP.field the same for dimensions that "should go together". But that complicated matters for FacetFields, b/c it first groups all CPs under their respective CLPs, and creates a Map<CategoryListParams,Iterable<CategoryPath>> . Then all the CPs of the same CLP are passed to CountingListBuilder. If I wanted to work w/ PerDimensionIndexingParams, I'd need to change FacetFields to map from a String -> (map of CLP -> CP) and then change CountingListBuilder accordingly. Also, CountingFacetsCollector would need to change as well, since currently it assumes a single CLP instance. In short, while this is doable, I think it's confusing, and could lead apps to think that if you need different OrdinalPolicy for dimensions, you also need different CLPs, and consequently different fields, which is bad! So I think that solution is good – whoever intends to control OrdinalPolicy, would need to create some Map, so with this approach, he'll create a Map(String,OrdinalPolicy). If he needs both worlds (multiple CLPs AND OrdinalPolicy-ies), then he needs to create two Maps ... doesn't sound a big deal to me.
        Hide
        Shai Erera added a comment -

        Mike and I have been testing some aspects of this issue - we should test some others too and paste all the results here. Here are the scenarios:

        ALL_BUT_DIMENSION

        This should be better than ALL, since it encodes half the ordinals for flat dimensions.
        The test would be to index all flat dimensions with ALL (trunk) vs ALL_BUT (patch) and compare times.

        Per-Dimension Rollup

        This should be better for when you need to rollup counts for a small dimension (saves iterating on a large counts array).
        The test would be to:

        • Index all dimensions (flat + hierarchical), so the counts[] is big (2.5M entries)
          • Index Date in its own CLP in both cases, the idea is to generate a big taxonomy
        • Query with a FacetRequest Date/
        • Trunk would do the full traversal, patch would do the per-dim rollup and hopefully should be better

        Per-Dimension OrdinalPolicy

        The only advantage here is that it lets you index under the same CLP dimensions with different OrdinalPolicy settings.
        To compare, we'd need to index with trunk the dimensions as ALL or NO, vs patch which can mix between ALL and NO
        (we can discard ALL_BUT) for this test.

        Show
        Shai Erera added a comment - Mike and I have been testing some aspects of this issue - we should test some others too and paste all the results here. Here are the scenarios: ALL_BUT_DIMENSION This should be better than ALL, since it encodes half the ordinals for flat dimensions. The test would be to index all flat dimensions with ALL (trunk) vs ALL_BUT (patch) and compare times. Per-Dimension Rollup This should be better for when you need to rollup counts for a small dimension (saves iterating on a large counts array). The test would be to: Index all dimensions (flat + hierarchical), so the counts[] is big (2.5M entries) Index Date in its own CLP in both cases, the idea is to generate a big taxonomy Query with a FacetRequest Date/ Trunk would do the full traversal, patch would do the per-dim rollup and hopefully should be better Per-Dimension OrdinalPolicy The only advantage here is that it lets you index under the same CLP dimensions with different OrdinalPolicy settings. To compare, we'd need to index with trunk the dimensions as ALL or NO, vs patch which can mix between ALL and NO (we can discard ALL_BUT) for this test.
        Hide
        Michael McCandless added a comment -

        Here's the per-dim-rollup results. The index has 2 CLPs: one with only Date, the other with username+categories (= many ords, flat), and I facet only on Date CLP.

        base = trunk, comp = per-dim rollup

                            Task    QPS base      StdDev    QPS comp     StdDev                Pct diff
                        HighTerm       21.08      (6.9%)       21.26     (5.2%)    0.9% ( -10% -   13%)
                         MedTerm       50.06      (6.1%)       52.39     (4.4%)    4.6% (  -5% -   16%)
                         LowTerm       97.70      (4.7%)      110.47     (4.6%)   13.1% (   3% -   23%)
        

        So it helps most for queries matching fewer docs since the rollup is a fixed cost in the end ...

        Show
        Michael McCandless added a comment - Here's the per-dim-rollup results. The index has 2 CLPs: one with only Date, the other with username+categories (= many ords, flat), and I facet only on Date CLP. base = trunk, comp = per-dim rollup Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 21.08 (6.9%) 21.26 (5.2%) 0.9% ( -10% - 13%) MedTerm 50.06 (6.1%) 52.39 (4.4%) 4.6% ( -5% - 16%) LowTerm 97.70 (4.7%) 110.47 (4.6%) 13.1% ( 3% - 23%) So it helps most for queries matching fewer docs since the rollup is a fixed cost in the end ...
        Hide
        Michael McCandless added a comment -

        base = ALL, comp = ALL_BUT; I disabled checking facet counts match in luceneutil because with ALL_BUT the dims have 0 count (expected) but with ALL they have the rolled up count. Also, I ran on the 5 non-huge flat dims in luceneutil:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                        HighTerm       18.91      (1.4%)       21.32      (2.2%)   12.8% (   9% -   16%)
                         MedTerm       45.93      (1.2%)       54.08      (1.7%)   17.7% (  14% -   20%)
                         LowTerm      101.51      (0.5%)      124.79      (1.4%)   22.9% (  20% -   24%)
        
        Show
        Michael McCandless added a comment - base = ALL, comp = ALL_BUT; I disabled checking facet counts match in luceneutil because with ALL_BUT the dims have 0 count (expected) but with ALL they have the rolled up count. Also, I ran on the 5 non-huge flat dims in luceneutil: Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 18.91 (1.4%) 21.32 (2.2%) 12.8% ( 9% - 16%) MedTerm 45.93 (1.2%) 54.08 (1.7%) 17.7% ( 14% - 20%) LowTerm 101.51 (0.5%) 124.79 (1.4%) 22.9% ( 20% - 24%)
        Hide
        Gilad Barkai added a comment -

        How can a mess be avoided when allowing different OrdinalPolicies in the same CLP ?
        There would be ordinals which has the parents, and ordinals that dont? How can the collector or aggregator know which ordinals should be delt with as without parents and which should not?

        Show
        Gilad Barkai added a comment - How can a mess be avoided when allowing different OrdinalPolicies in the same CLP ? There would be ordinals which has the parents, and ordinals that dont? How can the collector or aggregator know which ordinals should be delt with as without parents and which should not?
        Hide
        Shai Erera added a comment -

        First, look at the patch, there's a test for that .

        The way it works is that we now do per-dimension counts rollup. That is, say that you index the following dimensions under the same CLP: A (NO_PARENTS), B (ALL_PARENTS) and C (ALL_BUT_DIMENSION). When you ask to aggregate all of them then:

        • StandardFacetsCollector does not work with NO_PARENTS (not sure if it throws a hard exception now, I'll check). So your only choice is CountingFacetsCollector.
        • CountingFacetsCollector works as follows:
          • Aggregates in a FixedBitSet (one per segment) the matching documents.
          • It then traverses the counting list and counts all the ordinals that it finds.
          • Then when it computes the facet results, it goes per FacetRequest:
            • If the FR.categoryPath was indexed with NO_PARENTS ("A" in our case), it rolls up its ordinals only, not caring about the huge counts[]. See Mike's test above, this general improves the process by a bit.
            • Otherwise, there's no more rollup needed. "B" would have a count too, while "C" count will be 0, and only its children will be counted.

        Hope that explains it.

        Show
        Shai Erera added a comment - First, look at the patch, there's a test for that . The way it works is that we now do per-dimension counts rollup. That is, say that you index the following dimensions under the same CLP: A (NO_PARENTS), B (ALL_PARENTS) and C (ALL_BUT_DIMENSION). When you ask to aggregate all of them then: StandardFacetsCollector does not work with NO_PARENTS (not sure if it throws a hard exception now, I'll check). So your only choice is CountingFacetsCollector. CountingFacetsCollector works as follows: Aggregates in a FixedBitSet (one per segment) the matching documents. It then traverses the counting list and counts all the ordinals that it finds. Then when it computes the facet results, it goes per FacetRequest: If the FR.categoryPath was indexed with NO_PARENTS ("A" in our case), it rolls up its ordinals only, not caring about the huge counts[]. See Mike's test above, this general improves the process by a bit. Otherwise, there's no more rollup needed. "B" would have a count too, while "C" count will be 0, and only its children will be counted. Hope that explains it.
        Hide
        Shai Erera added a comment -

        Mike, those are great results (ALL vs ALL_BUT) ... amazing what a change of a flag can do .

        Show
        Shai Erera added a comment - Mike, those are great results (ALL vs ALL_BUT) ... amazing what a change of a flag can do .
        Hide
        Shai Erera added a comment -

        Added an assert to StandardFacetsCollector that none of the requested categories were indexed with NO_PARENTS. FacetsCollector.create() will fail if neither CountingFC nor StandardFC can handle the given search parameters.

        This is temporary until we finish with LUCENE-4610.

        Show
        Shai Erera added a comment - Added an assert to StandardFacetsCollector that none of the requested categories were indexed with NO_PARENTS. FacetsCollector.create() will fail if neither CountingFC nor StandardFC can handle the given search parameters. This is temporary until we finish with LUCENE-4610 .
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1440416

        LUCENE-4715: Add OrdinalPolicy.ALL_BUT_DIMENSION

        Show
        Commit Tag Bot added a comment - [trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1440416 LUCENE-4715 : Add OrdinalPolicy.ALL_BUT_DIMENSION
        Hide
        Shai Erera added a comment -

        Committed to trunk and 4x. I don't think there's much value in the last test case (measuring the impact of allowing per-dimension OrdinalPolicy).

        Show
        Shai Erera added a comment - Committed to trunk and 4x. I don't think there's much value in the last test case (measuring the impact of allowing per-dimension OrdinalPolicy).
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Shai Erera
        http://svn.apache.org/viewvc?view=revision&revision=1440419

        LUCENE-4715: Add OrdinalPolicy.ALL_BUT_DIMENSION

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1440419 LUCENE-4715 : Add OrdinalPolicy.ALL_BUT_DIMENSION
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Shai Erera
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development