Lucene - Core
  1. Lucene - Core
  2. LUCENE-4795

Add FacetsCollector based on SortedSetDocValues

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3, 5.0
    • Component/s: modules/facet
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Recently (LUCENE-4765) we added multi-valued DocValues field
      (SortedSetDocValuesField), and this can be used for faceting in Solr
      (SOLR-4490). I think we should also add support in the facet module?

      It'd be an option with different tradeoffs. Eg, it wouldn't require
      the taxonomy index, since the main index handles label/ord resolving.

      There are at least two possible approaches:

      • On every reopen, build the seg -> global ord map, and then on
        every collect, get the seg ord, map it to the global ord space,
        and increment counts. This adds cost during reopen in proportion
        to number of unique terms ...
      • On every collect, increment counts based on the seg ords, and then
        do a "merge" in the end just like distributed faceting does.

      The first approach is much easier so I built a quick prototype using
      that. The prototype does the counting, but it does NOT do the top K
      facets gathering in the end, and it doesn't "know" parent/child ord
      relationships, so there's tons more to do before this is real. I also
      was unsure how to properly integrate it since the existing classes
      seem to expect that you use a taxonomy index to resolve ords.

      I ran a quick performance test. base = trunk except I disabled the
      "compute top-K" in FacetsAccumulator to make the comparison fair; comp
      = using the prototype collector in the patch:

                          Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                     OrHighLow       18.79      (2.5%)       14.36      (3.3%)  -23.6% ( -28% -  -18%)
                      HighTerm       21.58      (2.4%)       16.53      (3.7%)  -23.4% ( -28% -  -17%)
                     OrHighMed       18.20      (2.5%)       13.99      (3.3%)  -23.2% ( -28% -  -17%)
                       Prefix3       14.37      (1.5%)       11.62      (3.5%)  -19.1% ( -23% -  -14%)
                       LowTerm      130.80      (1.6%)      106.95      (2.4%)  -18.2% ( -21% -  -14%)
                    OrHighHigh        9.60      (2.6%)        7.88      (3.5%)  -17.9% ( -23% -  -12%)
                   AndHighHigh       24.61      (0.7%)       20.74      (1.9%)  -15.7% ( -18% -  -13%)
                        Fuzzy1       49.40      (2.5%)       43.48      (1.9%)  -12.0% ( -15% -   -7%)
               MedSloppyPhrase       27.06      (1.6%)       23.95      (2.3%)  -11.5% ( -15% -   -7%)
                       MedTerm       51.43      (2.0%)       46.21      (2.7%)  -10.2% ( -14% -   -5%)
                        IntNRQ        4.02      (1.6%)        3.63      (4.0%)   -9.7% ( -15% -   -4%)
                      Wildcard       29.14      (1.5%)       26.46      (2.5%)   -9.2% ( -13% -   -5%)
              HighSloppyPhrase        0.92      (4.5%)        0.87      (5.8%)   -5.4% ( -15% -    5%)
                   MedSpanNear       29.51      (2.5%)       27.94      (2.2%)   -5.3% (  -9% -    0%)
                  HighSpanNear        3.55      (2.4%)        3.38      (2.0%)   -4.9% (  -9% -    0%)
                    AndHighMed      108.34      (0.9%)      104.55      (1.1%)   -3.5% (  -5% -   -1%)
               LowSloppyPhrase       20.50      (2.0%)       20.09      (4.2%)   -2.0% (  -8% -    4%)
                     LowPhrase       21.60      (6.0%)       21.26      (5.1%)   -1.6% ( -11% -   10%)
                        Fuzzy2       53.16      (3.9%)       52.40      (2.7%)   -1.4% (  -7% -    5%)
                   LowSpanNear        8.42      (3.2%)        8.45      (3.0%)    0.3% (  -5% -    6%)
                       Respell       45.17      (4.3%)       45.38      (4.4%)    0.5% (  -7% -    9%)
                     MedPhrase      113.93      (5.8%)      115.02      (4.9%)    1.0% (  -9% -   12%)
                    AndHighLow      596.42      (2.5%)      617.12      (2.8%)    3.5% (  -1% -    8%)
                    HighPhrase       17.30     (10.5%)       18.36      (9.1%)    6.2% ( -12% -   28%)
      

      I'm impressed that this approach is only ~24% slower in the worst
      case! I think this means it's a good option to make available? Yes
      it has downsides (NRT reopen more costly, small added RAM usage,
      slightly slower faceting), but it's also simpler (no taxo index to
      manage).

      1. LUCENE-4795.patch
        48 kB
        Michael McCandless
      2. LUCENE-4795.patch
        32 kB
        Michael McCandless
      3. LUCENE-4795.patch
        27 kB
        Michael McCandless
      4. LUCENE-4795.patch
        24 kB
        Michael McCandless
      5. LUCENE-4795.patch
        4 kB
        Michael McCandless
      6. LUCENE-4795.patch
        4 kB
        Michael McCandless
      7. pleaseBenchmarkMe.patch
        1 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Thanks for benchmarking this approach Mike!

        I'm happy with the results, though i still added a TODO that we should investigate the cost of the special packed-ints compression we do.

        can you benchmark the attached change just out of curiousity?

        Show
        Robert Muir added a comment - Thanks for benchmarking this approach Mike! I'm happy with the results, though i still added a TODO that we should investigate the cost of the special packed-ints compression we do. can you benchmark the attached change just out of curiousity?
        Hide
        Adrien Grand added a comment -

        Not having to manage a taxonomy index is very appealing to me!

        What about collecting based on segment ords and bulk translating these ords to the global ords in setNextReader and when the collection ends? This way ordinalMap.get would be called less often (once per value per segment instead of once per value per doc per segment) and in a sequential way so I assume it would be faster while remaining easy to implement?

        Show
        Adrien Grand added a comment - Not having to manage a taxonomy index is very appealing to me! What about collecting based on segment ords and bulk translating these ords to the global ords in setNextReader and when the collection ends? This way ordinalMap.get would be called less often (once per value per segment instead of once per value per doc per segment) and in a sequential way so I assume it would be faster while remaining easy to implement?
        Hide
        Michael McCandless added a comment -

        What about collecting based on segment ords and bulk translating these ords to the global ords in setNextReader and when the collection ends?

        That sounds great! I'll try that.

        Show
        Michael McCandless added a comment - What about collecting based on segment ords and bulk translating these ords to the global ords in setNextReader and when the collection ends? That sounds great! I'll try that.
        Hide
        Shai Erera added a comment -

        Nice Mike.

        If you want to integrate that with the current classes, all you need to do is to implement a partial TaxonomyReader, which resolves ordinals to CPs using the global ord map? Or actually make that TR the entity that's responsible to manage to global ordinal map, so that TR.doOpenIfChanged opens the new segments and updates the global map?

        Since this taxonomy, at least currently, doesn't support hierarchical facets, you'll need to hack something as a ParallelTaxoArray, but that should be easy .. I think.

        Is the only benefit in this approach that you don't need to manage a sidecar taxonomy index?

        Show
        Shai Erera added a comment - Nice Mike. If you want to integrate that with the current classes, all you need to do is to implement a partial TaxonomyReader, which resolves ordinals to CPs using the global ord map? Or actually make that TR the entity that's responsible to manage to global ordinal map, so that TR.doOpenIfChanged opens the new segments and updates the global map? Since this taxonomy, at least currently, doesn't support hierarchical facets, you'll need to hack something as a ParallelTaxoArray, but that should be easy .. I think. Is the only benefit in this approach that you don't need to manage a sidecar taxonomy index?
        Hide
        Michael McCandless added a comment -

        Fixed small bug (wasn't counting ord 0); here's the same test as
        before, just running on Term & Or queries:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         MedTerm       52.70      (3.1%)       40.87      (2.0%)  -22.5% ( -26% -  -17%)
                       OrHighMed       25.54      (3.6%)       20.18      (2.4%)  -21.0% ( -26% -  -15%)
                        HighTerm        9.22      (4.1%)        7.33      (2.4%)  -20.4% ( -25% -  -14%)
                      OrHighHigh       12.92      (3.6%)       10.41      (2.8%)  -19.4% ( -24% -  -13%)
                       OrHighLow       13.12      (3.8%)       10.61      (2.8%)  -19.2% ( -24% -  -13%)
                         LowTerm      145.94      (1.9%)      125.51      (1.6%)  -14.0% ( -17% -  -10%)
        

        Then I applied Rob's patch (base = trunk, comp = Rob's + my patch):

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         MedTerm       52.97      (2.2%)       42.34      (1.6%)  -20.1% ( -23% -  -16%)
                       OrHighMed       25.66      (2.2%)       20.73      (1.7%)  -19.2% ( -22% -  -15%)
                      OrHighHigh       12.99      (2.4%)       10.69      (1.8%)  -17.7% ( -21% -  -13%)
                       OrHighLow       13.19      (2.3%)       10.94      (2.0%)  -17.0% ( -20% -  -12%)
                        HighTerm        9.30      (2.6%)        7.76      (1.8%)  -16.6% ( -20% -  -12%)
                         LowTerm      146.48      (1.3%)      129.04      (0.9%)  -11.9% ( -13% -   -9%)
        

        So a wee bit faster but not much... (good! The awesome predictive
        compression from MonotonicALB doesn't hurt much).

        Then I made a new collector that resolves ords after each segment from
        Adrien's idea (SortedSetDocValuesCollectorMergeBySeg) – base = same
        as above, comp = new collector w/o Rob's patch:

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                        HighTerm        9.29      (3.1%)        7.14      (1.9%)  -23.2% ( -27% -  -18%)
                       OrHighMed       25.51      (2.7%)       19.60      (2.2%)  -23.2% ( -27% -  -18%)
                       OrHighLow       13.08      (2.8%)       10.20      (2.3%)  -22.0% ( -26% -  -17%)
                      OrHighHigh       12.89      (2.9%)       10.21      (2.6%)  -20.8% ( -25% -  -15%)
                         MedTerm       53.00      (2.7%)       43.34      (1.5%)  -18.2% ( -21% -  -14%)
                         LowTerm      145.97      (1.6%)      133.05      (0.9%)   -8.9% ( -11% -   -6%)
        

        Strangely it's not really faster ... maybe I have a bug.
        Unfortunately, until we get the top K working, we can't do the
        end-to-end comparison to make sure we're getting the right facet
        values ...

        Show
        Michael McCandless added a comment - Fixed small bug (wasn't counting ord 0); here's the same test as before, just running on Term & Or queries: Task QPS base StdDev QPS comp StdDev Pct diff MedTerm 52.70 (3.1%) 40.87 (2.0%) -22.5% ( -26% - -17%) OrHighMed 25.54 (3.6%) 20.18 (2.4%) -21.0% ( -26% - -15%) HighTerm 9.22 (4.1%) 7.33 (2.4%) -20.4% ( -25% - -14%) OrHighHigh 12.92 (3.6%) 10.41 (2.8%) -19.4% ( -24% - -13%) OrHighLow 13.12 (3.8%) 10.61 (2.8%) -19.2% ( -24% - -13%) LowTerm 145.94 (1.9%) 125.51 (1.6%) -14.0% ( -17% - -10%) Then I applied Rob's patch (base = trunk, comp = Rob's + my patch): Task QPS base StdDev QPS comp StdDev Pct diff MedTerm 52.97 (2.2%) 42.34 (1.6%) -20.1% ( -23% - -16%) OrHighMed 25.66 (2.2%) 20.73 (1.7%) -19.2% ( -22% - -15%) OrHighHigh 12.99 (2.4%) 10.69 (1.8%) -17.7% ( -21% - -13%) OrHighLow 13.19 (2.3%) 10.94 (2.0%) -17.0% ( -20% - -12%) HighTerm 9.30 (2.6%) 7.76 (1.8%) -16.6% ( -20% - -12%) LowTerm 146.48 (1.3%) 129.04 (0.9%) -11.9% ( -13% - -9%) So a wee bit faster but not much... (good! The awesome predictive compression from MonotonicALB doesn't hurt much). Then I made a new collector that resolves ords after each segment from Adrien's idea (SortedSetDocValuesCollectorMergeBySeg) – base = same as above, comp = new collector w/o Rob's patch: Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 9.29 (3.1%) 7.14 (1.9%) -23.2% ( -27% - -18%) OrHighMed 25.51 (2.7%) 19.60 (2.2%) -23.2% ( -27% - -18%) OrHighLow 13.08 (2.8%) 10.20 (2.3%) -22.0% ( -26% - -17%) OrHighHigh 12.89 (2.9%) 10.21 (2.6%) -20.8% ( -25% - -15%) MedTerm 53.00 (2.7%) 43.34 (1.5%) -18.2% ( -21% - -14%) LowTerm 145.97 (1.6%) 133.05 (0.9%) -8.9% ( -11% - -6%) Strangely it's not really faster ... maybe I have a bug. Unfortunately, until we get the top K working, we can't do the end-to-end comparison to make sure we're getting the right facet values ...
        Hide
        Robert Muir added a comment -

        Thanks for benchmarking: I think we should keep the monotonic compression!
        It will use significantly less RAM for this thing.

        Show
        Robert Muir added a comment - Thanks for benchmarking: I think we should keep the monotonic compression! It will use significantly less RAM for this thing.
        Hide
        Michael McCandless added a comment -

        If you want to integrate that with the current classes, all you need to do is to implement a partial TaxonomyReader, which resolves ordinals to CPs using the global ord map? Or actually make that TR the entity that's responsible to manage to global ordinal map, so that TR.doOpenIfChanged opens the new segments and updates the global map?

        That sounds great!

        Since this taxonomy, at least currently, doesn't support hierarchical facets, you'll need to hack something as a ParallelTaxoArray, but that should be easy .. I think.

        OK.

        I think it could be hierarchical w/o so much work, ie on reopen as it
        walks the terms it should be able to easily build up the parent/child
        arrays since the terms are in sorted order. Hmm, except, with SSDV
        you cannot have a term/ord that had no docs indexed. So the
        "ancestor" ords would not exist... hmm. Better start
        non-hierarchical.

        I guess if we are non-hierarchical then we don't really need to
        integrate at indexing time? Ie, app can just add the facet values
        using SortedSetDVF.

        Is the only benefit in this approach that you don't need to manage a sidecar taxonomy index?

        I think so?

        Show
        Michael McCandless added a comment - If you want to integrate that with the current classes, all you need to do is to implement a partial TaxonomyReader, which resolves ordinals to CPs using the global ord map? Or actually make that TR the entity that's responsible to manage to global ordinal map, so that TR.doOpenIfChanged opens the new segments and updates the global map? That sounds great! Since this taxonomy, at least currently, doesn't support hierarchical facets, you'll need to hack something as a ParallelTaxoArray, but that should be easy .. I think. OK. I think it could be hierarchical w/o so much work, ie on reopen as it walks the terms it should be able to easily build up the parent/child arrays since the terms are in sorted order. Hmm, except, with SSDV you cannot have a term/ord that had no docs indexed. So the "ancestor" ords would not exist... hmm. Better start non-hierarchical. I guess if we are non-hierarchical then we don't really need to integrate at indexing time? Ie, app can just add the facet values using SortedSetDVF. Is the only benefit in this approach that you don't need to manage a sidecar taxonomy index? I think so?
        Hide
        Michael McCandless added a comment -

        New patch ... I think it's close but there are still some nocommits.

        I switched to a FacetsAccumulator (SortedSetDVAccumulator) instead of
        XXXCollector because:

        • It's more fair since it now does all counting "in the end",
          matching trunk, which was a bit faster than count-as-you-go when
          we last tested.
        • It means you can use this class with DrillSideways ... I fixed
          TestDrillSideways to test it (passes!).

        I also got a custom topK impl working.

        The facets are the same as trunk, except for tie-break differences.
        The new collector is better in this regard: it breaks ties in an
        understandable-to-the-end-user way (by ord = Unicode sort order),
        unlike the taxo index which is "order in which label was indexed into
        taxo index" (confusing to end user).

        I first went down the road of making a TaxoReader that wraps a
        SlowCompositeReaderWrapper ... but this became problematic because a
        DV instance is not thread-safe, yet TaxoReader's APIs are supposed to
        be thread-safe. I also really didn't like making 3 int[maxOrd] to
        handle "hierarchy" when SorteSetDV facets only support 2 level
        hierarchy (dim + child).

        So I backed off of that and made a separate State object, which you
        must re-init after ever top-reader-reopen, and it does the heavyish
        stuff.

        Current results (base = trunk w/ allbutdim, comp = patch, full wikibig
        index with 5 flat dims):

                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                        HighTerm        9.36      (1.9%)        7.02      (3.6%)  -25.0% ( -29% -  -19%)
                         MedTerm       53.21      (1.5%)       40.65      (2.8%)  -23.6% ( -27% -  -19%)
                       OrHighLow       13.25      (2.1%)       10.55      (3.4%)  -20.4% ( -25% -  -15%)
                       OrHighMed       25.77      (1.9%)       20.90      (3.1%)  -18.9% ( -23% -  -14%)
                      OrHighHigh       13.03      (2.2%)       10.63      (3.2%)  -18.4% ( -23% -  -13%)
                         LowTerm      146.28      (1.7%)      120.22      (1.7%)  -17.8% ( -20% -  -14%)
        
        Show
        Michael McCandless added a comment - New patch ... I think it's close but there are still some nocommits. I switched to a FacetsAccumulator (SortedSetDVAccumulator) instead of XXXCollector because: It's more fair since it now does all counting "in the end", matching trunk, which was a bit faster than count-as-you-go when we last tested. It means you can use this class with DrillSideways ... I fixed TestDrillSideways to test it (passes!). I also got a custom topK impl working. The facets are the same as trunk, except for tie-break differences. The new collector is better in this regard: it breaks ties in an understandable-to-the-end-user way (by ord = Unicode sort order), unlike the taxo index which is "order in which label was indexed into taxo index" (confusing to end user). I first went down the road of making a TaxoReader that wraps a SlowCompositeReaderWrapper ... but this became problematic because a DV instance is not thread-safe, yet TaxoReader's APIs are supposed to be thread-safe. I also really didn't like making 3 int [maxOrd] to handle "hierarchy" when SorteSetDV facets only support 2 level hierarchy (dim + child). So I backed off of that and made a separate State object, which you must re-init after ever top-reader-reopen, and it does the heavyish stuff. Current results (base = trunk w/ allbutdim, comp = patch, full wikibig index with 5 flat dims): Task QPS base StdDev QPS comp StdDev Pct diff HighTerm 9.36 (1.9%) 7.02 (3.6%) -25.0% ( -29% - -19%) MedTerm 53.21 (1.5%) 40.65 (2.8%) -23.6% ( -27% - -19%) OrHighLow 13.25 (2.1%) 10.55 (3.4%) -20.4% ( -25% - -15%) OrHighMed 25.77 (1.9%) 20.90 (3.1%) -18.9% ( -23% - -14%) OrHighHigh 13.03 (2.2%) 10.63 (3.2%) -18.4% ( -23% - -13%) LowTerm 146.28 (1.7%) 120.22 (1.7%) -17.8% ( -20% - -14%)
        Hide
        Michael McCandless added a comment -

        New patch, just cleaning everything up ... I think it's ready.

        I know this approach isn't perfect (it has its own topK code), but I
        think it's a good enough start that we should get it in and make it
        available. We can iterate over time ...

        Show
        Michael McCandless added a comment - New patch, just cleaning everything up ... I think it's ready. I know this approach isn't perfect (it has its own topK code), but I think it's a good enough start that we should get it in and make it available. We can iterate over time ...
        Hide
        Robert Muir added a comment -

        I'm not sure i understand the dim/value stuff going on inside the single dv field.

        wouldnt it be more natural to just use multiple lucene fields?

        Show
        Robert Muir added a comment - I'm not sure i understand the dim/value stuff going on inside the single dv field. wouldnt it be more natural to just use multiple lucene fields?
        Hide
        Michael McCandless added a comment -

        wouldnt it be more natural to just use multiple lucene fields?

        Yeah ... maybe. I agree this is a weird thing about the facet module,
        i.e. that we don't have a FacetField, and you add one per "dimension",
        like new FacetField("category", ...), new FacetField("price", ...).

        Instead you have any number of CategoryPaths, like
        /price/1-100/20-30/25.99 and /category/Fiction/..., and those are all
        indexed into a single $facet binary DV field (by default), using a
        single call to FacetFields.addFields.

        We could maybe generalize this new FacetsAccumulator to allow for
        "single DV field has single facet dim" approach ... but it's a break
        from how the rest of facets work, and I also suspect there's a
        non-trivial performance hit to it (haven't yet tested).

        I think we should explore that later? The current patch at least
        makes faceting in the "facet module" approach possible...

        Show
        Michael McCandless added a comment - wouldnt it be more natural to just use multiple lucene fields? Yeah ... maybe. I agree this is a weird thing about the facet module, i.e. that we don't have a FacetField, and you add one per "dimension", like new FacetField("category", ...), new FacetField("price", ...). Instead you have any number of CategoryPaths, like /price/1-100/20-30/25.99 and /category/Fiction/..., and those are all indexed into a single $facet binary DV field (by default), using a single call to FacetFields.addFields. We could maybe generalize this new FacetsAccumulator to allow for "single DV field has single facet dim" approach ... but it's a break from how the rest of facets work, and I also suspect there's a non-trivial performance hit to it (haven't yet tested). I think we should explore that later? The current patch at least makes faceting in the "facet module" approach possible...
        Hide
        Robert Muir added a comment -

        I agree: and actually rethinking it, I think its healthier to keep as much of the facet modules existing approach (minus the sidecar for this case).

        We can always add other approaches that work on each field like solr/elasticsearch. I'd rather have more choices out there.

        Show
        Robert Muir added a comment - I agree: and actually rethinking it, I think its healthier to keep as much of the facet modules existing approach (minus the sidecar for this case). We can always add other approaches that work on each field like solr/elasticsearch. I'd rather have more choices out there.
        Hide
        Shai Erera added a comment -

        I also suspect there's a non-trivial performance hit to it (haven't yet tested)

        Mike, we did experiment with that during the overall refactoring/optimization work. I was able to dig up the results – we tried to simulate "column-stride ordinals" by indexing different dimensions into different DV fields, using different taxoWriters. I created a SingleValuedFacetField (over the old DV, using VAR_INTS) and indexed the ordinal of the CP there. So I think kind of simulates SortedSet when there's only one value?

        The results were not good:

                           Per Dim    Trunk   Diff (%)
        AndHighHigh         64.35     89.41    38.94%
        AndHighLow         358.7    1033.32   188.07%
        AndHighMed         181.22    318.36    75.68%
        Fuzzy1              78.49    116.99    49.05%
        Fuzzy2              72.79    104.12    43.04%
        HighPhrase          36.83     43.67    18.57%
        HighSloppyPhrase     3.12      3.25     4.17%
        HighSpanNear        10.76     11.74     9.11%
        HighTerm            40.23     65.07    61.74%
        IntNRQ             109.46    114.52     4.62%
        LowPhrase           60.63     72.59    19.73%
        LowSloppyPhrase     66.01     82.9     25.59%
        LowSpanNear         38.5      43.81    13.79%
        LowTerm            187.58     378.86  101.97%
        MedPhrase          103.57     153.26   47.98%
        MedSloppyPhrase     70.86     89.78    26.70%
        MedSpanNear         69.29     89.2     28.73%
        MedTerm            120.72    237.39    96.65%
        OrHighHigh          26.41     38.55    45.97%
        OrHighLow           36.6      55.38    51.31%
        OrHighMed           37.62     56.81    51.01%
        PKLookup           250.59    301.5     20.32%
        Prefix3            170.76    176.48     3.35%
        Respell            101.4     136.58    34.69%
        Wildcard           225.08    252.69    12.27%
        

        Trunk was trunk at the time, and per-dim is the new FacetField. We suspected that that's because when you index different dimensions into different DV fields, you need to go over the result docs X times (where X is the number of dimensions you asked to count; we tested 9).

        I think its healthier to keep as much of the facet modules existing approach

        Do you have an idea how can one use the new field to index weighted facets? I.e. category/Computers=0.9 and category/Sceience=0.84? If the only way is to use BinaryDV, is there a way we can use the ordinals from SortedSet to put in a BDV 1=0.9, 2=0.84? Hmm, but since those ords are per-segment, we'd need to rewrite the BDV upon merge right?

        I ask because it seems that the only thing that we get from this SortedSet approach is not having to maintain a sidecar index (which for some reason freaks everybody), and we even lose performance. Plus, I don't see how we can support other facet features with it. So perhaps we should focus on how to use the search index to build a taxonomy? Maybe it's all in-memory, that's fine. If we manage to support on-disk lookups too, even better. But if we do that, then we should have no problems supporting all current facet features, because all that the taxonomy index gives us is a global-ordinal (plus hierarchy management, but I think we can do that w/ SortedSet too). We can of course explore that in a different issue.

        Show
        Shai Erera added a comment - I also suspect there's a non-trivial performance hit to it (haven't yet tested) Mike, we did experiment with that during the overall refactoring/optimization work. I was able to dig up the results – we tried to simulate "column-stride ordinals" by indexing different dimensions into different DV fields, using different taxoWriters. I created a SingleValuedFacetField (over the old DV, using VAR_INTS) and indexed the ordinal of the CP there. So I think kind of simulates SortedSet when there's only one value? The results were not good: Per Dim Trunk Diff (%) AndHighHigh 64.35 89.41 38.94% AndHighLow 358.7 1033.32 188.07% AndHighMed 181.22 318.36 75.68% Fuzzy1 78.49 116.99 49.05% Fuzzy2 72.79 104.12 43.04% HighPhrase 36.83 43.67 18.57% HighSloppyPhrase 3.12 3.25 4.17% HighSpanNear 10.76 11.74 9.11% HighTerm 40.23 65.07 61.74% IntNRQ 109.46 114.52 4.62% LowPhrase 60.63 72.59 19.73% LowSloppyPhrase 66.01 82.9 25.59% LowSpanNear 38.5 43.81 13.79% LowTerm 187.58 378.86 101.97% MedPhrase 103.57 153.26 47.98% MedSloppyPhrase 70.86 89.78 26.70% MedSpanNear 69.29 89.2 28.73% MedTerm 120.72 237.39 96.65% OrHighHigh 26.41 38.55 45.97% OrHighLow 36.6 55.38 51.31% OrHighMed 37.62 56.81 51.01% PKLookup 250.59 301.5 20.32% Prefix3 170.76 176.48 3.35% Respell 101.4 136.58 34.69% Wildcard 225.08 252.69 12.27% Trunk was trunk at the time, and per-dim is the new FacetField. We suspected that that's because when you index different dimensions into different DV fields, you need to go over the result docs X times (where X is the number of dimensions you asked to count; we tested 9). I think its healthier to keep as much of the facet modules existing approach Do you have an idea how can one use the new field to index weighted facets? I.e. category/Computers=0.9 and category/Sceience=0.84 ? If the only way is to use BinaryDV, is there a way we can use the ordinals from SortedSet to put in a BDV 1=0.9, 2=0.84 ? Hmm, but since those ords are per-segment, we'd need to rewrite the BDV upon merge right? I ask because it seems that the only thing that we get from this SortedSet approach is not having to maintain a sidecar index (which for some reason freaks everybody), and we even lose performance. Plus, I don't see how we can support other facet features with it. So perhaps we should focus on how to use the search index to build a taxonomy? Maybe it's all in-memory, that's fine. If we manage to support on-disk lookups too, even better. But if we do that, then we should have no problems supporting all current facet features, because all that the taxonomy index gives us is a global-ordinal (plus hierarchy management, but I think we can do that w/ SortedSet too). We can of course explore that in a different issue.
        Hide
        Shai Erera added a comment -

        About the patch: Mike, why do you need to initialize a FacetRequest like so: requests.add(new CountFacetRequest(new CategoryPath("a", sep), 10));? Why do you need to do "a/" and not just "a"? It's not like how requests are initialized today right?

        Show
        Shai Erera added a comment - About the patch: Mike, why do you need to initialize a FacetRequest like so: requests.add(new CountFacetRequest(new CategoryPath("a", sep), 10)); ? Why do you need to do "a/" and not just "a"? It's not like how requests are initialized today right?
        Hide
        Robert Muir added a comment -

        I ask because it seems that the only thing that we get from this SortedSet approach is not having to maintain a sidecar index (which for some reason freaks everybody), and we even lose performance. Plus, I don't see how we can support other facet features with it. So perhaps we should focus on how to use the search index to build a taxonomy? Maybe it's all in-memory, that's fine. If we manage to support on-disk lookups too, even better. But if we do that, then we should have no problems supporting all current facet features, because all that the taxonomy index gives us is a global-ordinal (plus hierarchy management, but I think we can do that w/ SortedSet too). We can of course explore that in a different issue.

        Well the taxonomy index doesn't give you global ordinals. it gives you global "termIDs", which are unique integers: but they aren't ordinals: their sort order is meaningless. this creates additional trouble if you want to try to integrate the current lucene facet module with e.g. solr that has faceting options that rely upon these properties.

        Its also unclear to me how the taxonomy index would really integrate in a distributed system like solr or elasticsearch. I know there has been discussion about it before, and I'm sure there are solutions, but it just seems fairly complicated.

        on the other hand SortedSet doesn't have these problems. maybe it doesnt support weighted facets or other features, but its a nice option. I personally don't think its the end of the world if Mike's patch doesnt support all the features of the faceting module initially or even ever.

        The idea is just to have more choices. I'm not saying you should get rid of the taxonomy index: just provide options. I don't think lucene's faceting support needs to be limited to only a single one-size-fits-all solution but instead have a few options with different tradeoffs. Compare with something like the suggest module, it has like 5 or 6 implementations.

        Show
        Robert Muir added a comment - I ask because it seems that the only thing that we get from this SortedSet approach is not having to maintain a sidecar index (which for some reason freaks everybody), and we even lose performance. Plus, I don't see how we can support other facet features with it. So perhaps we should focus on how to use the search index to build a taxonomy? Maybe it's all in-memory, that's fine. If we manage to support on-disk lookups too, even better. But if we do that, then we should have no problems supporting all current facet features, because all that the taxonomy index gives us is a global-ordinal (plus hierarchy management, but I think we can do that w/ SortedSet too). We can of course explore that in a different issue. Well the taxonomy index doesn't give you global ordinals. it gives you global "termIDs", which are unique integers: but they aren't ordinals: their sort order is meaningless. this creates additional trouble if you want to try to integrate the current lucene facet module with e.g. solr that has faceting options that rely upon these properties. Its also unclear to me how the taxonomy index would really integrate in a distributed system like solr or elasticsearch. I know there has been discussion about it before, and I'm sure there are solutions, but it just seems fairly complicated. on the other hand SortedSet doesn't have these problems. maybe it doesnt support weighted facets or other features, but its a nice option. I personally don't think its the end of the world if Mike's patch doesnt support all the features of the faceting module initially or even ever. The idea is just to have more choices. I'm not saying you should get rid of the taxonomy index: just provide options. I don't think lucene's faceting support needs to be limited to only a single one-size-fits-all solution but instead have a few options with different tradeoffs. Compare with something like the suggest module, it has like 5 or 6 implementations.
        Hide
        Shai Erera added a comment -

        Well the taxonomy index doesn't give you global ordinals. it gives you global "termIDs", which are unique integers: but they aren't ordinals

        That's right. I am not familiar with how Solr utilizes that, but I agree with your statement. The term ordinal was derived from the fact that the taxonomy does preserve order between parent/children. I.e. Date < Date/2010 <> Date/2011. So Date will always have a lower ordinal than its children, but there is not meaningful order between siblings.

        Its also unclear to me how the taxonomy index would really integrate in a distributed system like solr or elasticsearch.

        Why? We work with the taxonomy index in two modes in a distributed environment:

        1. Every shard maintains its own taxonomy index and facets are merged by their label. That's basically what Solr/ES/SortedSet would do right?
        2. In a specific project we run, where every document goes through a MapReduce analysis (no NRT!), we maintain a truly global taxonomy index, where ordinal=17 means the same category in all shards. The taxonomy index itself is replicated to all shards. There are tradeoffs of course, but you cannot do that with SortedSet right? The advantage is that you can do the merge by the ordinal (integer ID), rather than the label.

        I personally don't think its the end of the world if Mike's patch doesnt support all the features of the faceting module initially or even ever.

        +1, I don't criticize that approach negatively. I personally don't understand why the sidecar taxonomy index freaks the hell out of people, but I don't mind if there are multiple facet implementations. I can share with you that we used to have few implementations too, before we converged to one (and then contributed to Lucene).

        You didn't answer my question though, and perhaps it doesn't belong in this issue, but is there a way to utilize the ordinal given to a DV value somehow? Or is it internal to the SortedSet DV?

        Mike, should you also check in SortedSetDocValuesAccumulator that FR.getDepth() == 1? I don't think that you support counting up to depth N, right?

        Show
        Shai Erera added a comment - Well the taxonomy index doesn't give you global ordinals. it gives you global "termIDs", which are unique integers: but they aren't ordinals That's right. I am not familiar with how Solr utilizes that, but I agree with your statement. The term ordinal was derived from the fact that the taxonomy does preserve order between parent/children. I.e. Date < Date/2010 <> Date/2011. So Date will always have a lower ordinal than its children, but there is not meaningful order between siblings. Its also unclear to me how the taxonomy index would really integrate in a distributed system like solr or elasticsearch. Why? We work with the taxonomy index in two modes in a distributed environment: Every shard maintains its own taxonomy index and facets are merged by their label. That's basically what Solr/ES/SortedSet would do right? In a specific project we run, where every document goes through a MapReduce analysis (no NRT!), we maintain a truly global taxonomy index, where ordinal=17 means the same category in all shards. The taxonomy index itself is replicated to all shards. There are tradeoffs of course, but you cannot do that with SortedSet right? The advantage is that you can do the merge by the ordinal (integer ID), rather than the label. I personally don't think its the end of the world if Mike's patch doesnt support all the features of the faceting module initially or even ever. +1, I don't criticize that approach negatively. I personally don't understand why the sidecar taxonomy index freaks the hell out of people, but I don't mind if there are multiple facet implementations. I can share with you that we used to have few implementations too, before we converged to one (and then contributed to Lucene). You didn't answer my question though, and perhaps it doesn't belong in this issue, but is there a way to utilize the ordinal given to a DV value somehow? Or is it internal to the SortedSet DV? Mike, should you also check in SortedSetDocValuesAccumulator that FR.getDepth() == 1? I don't think that you support counting up to depth N, right?
        Hide
        Michael McCandless added a comment -

        Mike, why do you need to initialize a FacetRequest like so: requests.add(new CountFacetRequest(new CategoryPath("a", sep), 10));?

        Woops, that's just silly: I'll remove the sep there.

        Mike, should you also check in SortedSetDocValuesAccumulator that FR.getDepth() == 1? I don't think that you support counting up to depth N, right?

        Right, it only supports flat (dim / label) today ... ok, I'll add that
        check.

        Show
        Michael McCandless added a comment - Mike, why do you need to initialize a FacetRequest like so: requests.add(new CountFacetRequest(new CategoryPath("a", sep), 10));? Woops, that's just silly: I'll remove the sep there. Mike, should you also check in SortedSetDocValuesAccumulator that FR.getDepth() == 1? I don't think that you support counting up to depth N, right? Right, it only supports flat (dim / label) today ... ok, I'll add that check.
        Hide
        Shai Erera added a comment -

        Thanks. Also (sorry that it comes in parts), I find this confusing: new SortedSetDocValuesField("myfacets", new BytesRef("a" + sep + "foo")). The user needs to decide under which field all facets will be indexed. This could lead users to do new SSDVF("author", new BytesRef("shai")) and new SSDVF("date", new BytesRef("2010/March/13")). We know, from past results, that this will result in worse search performance. Also, this doesn't take a CP which is not consistent e.g. with the FacetRequest, where you need to pass a CP. So rather perhaps we should:

        • Add a FacetField (extends SSDVF) which takes a CP (potentially FacetIndexingParams as well).
        • It will call super(CLP.DEFAULT_FIELD, new BytesRef(cp.toString())) (we can optimize that later, e.g. have CP expose a BytesRef API too if we want).
        • Potentially, allow (or not) to define the field type.

        What do you think?

        Show
        Shai Erera added a comment - Thanks. Also (sorry that it comes in parts), I find this confusing: new SortedSetDocValuesField("myfacets", new BytesRef("a" + sep + "foo")) . The user needs to decide under which field all facets will be indexed. This could lead users to do new SSDVF("author", new BytesRef("shai")) and new SSDVF("date", new BytesRef("2010/March/13")) . We know, from past results, that this will result in worse search performance. Also, this doesn't take a CP which is not consistent e.g. with the FacetRequest, where you need to pass a CP. So rather perhaps we should: Add a FacetField (extends SSDVF) which takes a CP (potentially FacetIndexingParams as well). It will call super(CLP.DEFAULT_FIELD, new BytesRef(cp.toString())) (we can optimize that later, e.g. have CP expose a BytesRef API too if we want). Potentially, allow (or not) to define the field type. What do you think?
        Hide
        Shai Erera added a comment -

        Another thing, in the accumulator you do q.insertWithOverflow(new FacetResultNode(ord, counts[ord])). Any reason why you don't get a hold of the returned FRN? Or pre-fill the queue with sentinel objects?

        In ReaderState you can call SlowComposite.wrap instead of checking if (reader instanceof AtomicReader) which is essentially the same thing, just less lines of code.

        Show
        Shai Erera added a comment - Another thing, in the accumulator you do q.insertWithOverflow(new FacetResultNode(ord, counts [ord] )) . Any reason why you don't get a hold of the returned FRN? Or pre-fill the queue with sentinel objects? In ReaderState you can call SlowComposite.wrap instead of checking if (reader instanceof AtomicReader) which is essentially the same thing, just less lines of code.
        Hide
        Robert Muir added a comment -

        You didn't answer my question though, and perhaps it doesn't belong in this issue, but is there a way to utilize the ordinal given to a DV value somehow? Or is it internal to the SortedSet DV?

        Because I don't want to encourage crazy software designs to support fringe features. Want weighted faceting? use the tax index: pretty simple.

        Show
        Robert Muir added a comment - You didn't answer my question though, and perhaps it doesn't belong in this issue, but is there a way to utilize the ordinal given to a DV value somehow? Or is it internal to the SortedSet DV? Because I don't want to encourage crazy software designs to support fringe features. Want weighted faceting? use the tax index: pretty simple.
        Hide
        Michael McCandless added a comment -

        So rather perhaps we should:

        • Add a FacetField (extends SSDVF) which takes a CP (potentially FacetIndexingParams as well).
        • It will call super(CLP.DEFAULT_FIELD, new BytesRef(cp.toString())) (we can optimize that later, e.g. have CP expose a BytesRef API too if we want).
        • Potentially, allow (or not) to define the field type.

        I agree it's awkward now.

        But ... FacetField makes me nervous, just because it's too close to
        FacetFields and users may think they can mix & match the two
        approaches. It's trappy ... maybe SortedSetDocValuesFacetField
        instead?

        But you'd need to provide it with this separator... hmm, or maybe we
        can use the same sep as FIP.

        Separately, I wonder whether facet module should escape the delimiter
        when it appears in a cat path label, in general (and, here)? This way
        the app does not have to ensure it never appears in any label (which I
        think is tricky for some apps to do, eg a search server like
        ElasticSearch/Solr can't do this).

        Any reason why you don't get a hold of the returned FRN?

        I wanted to keep it simple for starters ... but I'll fix to reuse the
        rejected entry.

        Show
        Michael McCandless added a comment - So rather perhaps we should: Add a FacetField (extends SSDVF) which takes a CP (potentially FacetIndexingParams as well). It will call super(CLP.DEFAULT_FIELD, new BytesRef(cp.toString())) (we can optimize that later, e.g. have CP expose a BytesRef API too if we want). Potentially, allow (or not) to define the field type. I agree it's awkward now. But ... FacetField makes me nervous, just because it's too close to FacetFields and users may think they can mix & match the two approaches. It's trappy ... maybe SortedSetDocValuesFacetField instead? But you'd need to provide it with this separator... hmm, or maybe we can use the same sep as FIP. Separately, I wonder whether facet module should escape the delimiter when it appears in a cat path label, in general (and, here)? This way the app does not have to ensure it never appears in any label (which I think is tricky for some apps to do, eg a search server like ElasticSearch/Solr can't do this). Any reason why you don't get a hold of the returned FRN? I wanted to keep it simple for starters ... but I'll fix to reuse the rejected entry.
        Hide
        Shai Erera added a comment -

        maybe SortedSetDocValuesFacetField instead?

        +1.

        There's also another thing that bothers me – nothing prevents the app from creating a SSDVField with hierarchical facets now. So while your tests don't do it, I could easily create such a category (Date/2010/March/21), and then what would happen?

        By having SSDVFacetField you can control that by yelling at whoever attempts to do that. For that, taking a CP (see comment below) would make the check easier (if cp.length > 2, it's an error).

        But you'd need to provide it with this separator... hmm, or maybe we can use the same sep as FIP.

        I think that you should give it a CP, not BytesRef. There is a difference between the separator you give to CP(str, '/') to the one that's defined in FIP. The one that you give to CP (and you can use the vararg constructor to avoid that) just tells CP how to break the string into the path components. For instance, if we removed that ctor, you'd need to call the vararg one by splitting the string yourself.

        The one in FIP is used to write the drill-down terms. Currently, following your recent change, it's set to a character that is really not expected to be part of a category string. Therefore I don't worry about it.

        As for how the app can set the separator in a server-based environment, there are two options. Define it as part of the index schema – Solr/ES shouldn't care about it, they would just use it to split category strings, but app should know what's a safe separator for it. Another option that we chose to implement, since we work with JSON input, we just defined the facets as an array of strings and then there's no separator worry for the server. App should still be able to construct the array, meaning it should be able to split the category strings based on a 'safe' separator. I don't think we should worry about escaping delimiters. If an app cannot be sure that e.g. '>' is a safe separator, it should escape it itself?

        Show
        Shai Erera added a comment - maybe SortedSetDocValuesFacetField instead? +1. There's also another thing that bothers me – nothing prevents the app from creating a SSDVField with hierarchical facets now. So while your tests don't do it, I could easily create such a category (Date/2010/March/21), and then what would happen? By having SSDVFacetField you can control that by yelling at whoever attempts to do that. For that, taking a CP (see comment below) would make the check easier (if cp.length > 2, it's an error). But you'd need to provide it with this separator... hmm, or maybe we can use the same sep as FIP. I think that you should give it a CP, not BytesRef. There is a difference between the separator you give to CP(str, '/') to the one that's defined in FIP. The one that you give to CP (and you can use the vararg constructor to avoid that) just tells CP how to break the string into the path components. For instance, if we removed that ctor, you'd need to call the vararg one by splitting the string yourself. The one in FIP is used to write the drill-down terms. Currently, following your recent change, it's set to a character that is really not expected to be part of a category string. Therefore I don't worry about it. As for how the app can set the separator in a server-based environment, there are two options. Define it as part of the index schema – Solr/ES shouldn't care about it, they would just use it to split category strings, but app should know what's a safe separator for it. Another option that we chose to implement, since we work with JSON input, we just defined the facets as an array of strings and then there's no separator worry for the server. App should still be able to construct the array, meaning it should be able to split the category strings based on a 'safe' separator. I don't think we should worry about escaping delimiters. If an app cannot be sure that e.g. '>' is a safe separator, it should escape it itself?
        Hide
        Michael McCandless added a comment -

        By having SSDVFacetField you can control that by yelling at whoever attempts to do that.

        Good! I'll add that index-time check ...

        There is a difference between the separator you give to CP(str, '/') to the one that's defined in FIP.

        I'm talking about FIP's/Consts.java separator, that's baked into the index.

        I think hidden separators are trappy, and yes U+001F is rather unlikely to appear, I still think it's bad to disallow it in the input. Maybe an app wants to hide some payload in its facet labels so it joins two things with U+001F...

        Making users of server apps (Solr/ElasticSearch) specify this in their schema is still trappy...

        Also, the cost of doing this escaping / unescaping would be minor.

        Anyway, this is a separate issue ... I'll open a new jira.

        Show
        Michael McCandless added a comment - By having SSDVFacetField you can control that by yelling at whoever attempts to do that. Good! I'll add that index-time check ... There is a difference between the separator you give to CP(str, '/') to the one that's defined in FIP. I'm talking about FIP's/Consts.java separator, that's baked into the index. I think hidden separators are trappy, and yes U+001F is rather unlikely to appear, I still think it's bad to disallow it in the input. Maybe an app wants to hide some payload in its facet labels so it joins two things with U+001F... Making users of server apps (Solr/ElasticSearch) specify this in their schema is still trappy... Also, the cost of doing this escaping / unescaping would be minor. Anyway, this is a separate issue ... I'll open a new jira.
        Hide
        Shai Erera added a comment -

        From more than a dozen faceted search applications I've worked with, none had any issue with the separator. Therefore I think it's over defensive programming for Lucene to do it. If an app wants to be defensive, it can escape on its own.

        Show
        Shai Erera added a comment - From more than a dozen faceted search applications I've worked with, none had any issue with the separator. Therefore I think it's over defensive programming for Lucene to do it. If an app wants to be defensive, it can escape on its own.
        Hide
        Michael McCandless added a comment -

        From more than a dozen faceted search applications I've worked with, none had any issue with the separator. Therefore I think it's over defensive programming for Lucene to do it. If an app wants to be defensive, it can escape on its own.

        OK, fair enough.

        It still seems like bad practice to me but I guess we can keep it / worry about it if a user hits it.

        Show
        Michael McCandless added a comment - From more than a dozen faceted search applications I've worked with, none had any issue with the separator. Therefore I think it's over defensive programming for Lucene to do it. If an app wants to be defensive, it can escape on its own. OK, fair enough. It still seems like bad practice to me but I guess we can keep it / worry about it if a user hits it.
        Hide
        Michael McCandless added a comment -

        Patch. The sep is now hardwired to FIP's default sep, and I added SortedSetDocValuesFacetField for indexing.

        One thing I don't like is how I compute the DV field as clpParams.field + "sdv" ... I did this so that a single index could easily (w/o having to make a custom CLP) index both kinds of facets ... not sure if there's a better way.

        Show
        Michael McCandless added a comment - Patch. The sep is now hardwired to FIP's default sep, and I added SortedSetDocValuesFacetField for indexing. One thing I don't like is how I compute the DV field as clpParams.field + "sdv" ... I did this so that a single index could easily (w/o having to make a custom CLP) index both kinds of facets ... not sure if there's a better way.
        Hide
        Shai Erera added a comment -

        I don't think we should do it. First it's not really supported. Whoever users SSDVF must use the accompanied accumulator which cannot handle requests against the taxonomy, or hierarchical facets. Second, if someone already chose to have a sidecar taxonomy, why would he do both methods?

        Again, "by mistake" won't work because of what I wrote above. In order to make it work you need to write a special accumulator, and I think that if you already reached that level of expertise, you can work with per-dimension params?

        Show
        Shai Erera added a comment - I don't think we should do it. First it's not really supported. Whoever users SSDVF must use the accompanied accumulator which cannot handle requests against the taxonomy, or hierarchical facets. Second, if someone already chose to have a sidecar taxonomy, why would he do both methods? Again, "by mistake" won't work because of what I wrote above. In order to make it work you need to write a special accumulator, and I think that if you already reached that level of expertise, you can work with per-dimension params?
        Hide
        Shai Erera added a comment -

        Hmmm, correction. You cannot write an accumulator which handles both type of facets since the ordinal space is different and therefore it cannot reuse the same FacetArrays. While potentially it could allocate two arrays and delegate to the appropriate accumulator, I think this points at the high risk of indexing facets in two ways.

        If someone wants to do it though (I still don't think it makes sense), he should use PerDimIndexingParams. That's what we have them for.

        Mike, I think that SortedSetDVFacetField should go under o.a.l.facet.sortedset together with all relevant classes. That will make it both easier for people to integrate with it, since by looking at the jdocs you'll see all relevant classes as well as (I hope) reduce the chance for a mistake.

        Show
        Shai Erera added a comment - Hmmm, correction. You cannot write an accumulator which handles both type of facets since the ordinal space is different and therefore it cannot reuse the same FacetArrays. While potentially it could allocate two arrays and delegate to the appropriate accumulator, I think this points at the high risk of indexing facets in two ways. If someone wants to do it though (I still don't think it makes sense), he should use PerDimIndexingParams. That's what we have them for. Mike, I think that SortedSetDVFacetField should go under o.a.l.facet.sortedset together with all relevant classes. That will make it both easier for people to integrate with it, since by looking at the jdocs you'll see all relevant classes as well as (I hope) reduce the chance for a mistake.
        Hide
        Michael McCandless added a comment -

        The use case I'm picturing is an app would choose to use the existing
        APIs for the hierarchical fields and the sorted set field/accumulator
        for the flat fields; I think this is a reasonable typical example and
        I think it's important to make that easy. At search time they make
        two FacetCollectors ...

        The new method uses much less RAM for flat fields (doesn't allocate 3
        X int[maxOrd]), which I think for some apps is a good tradeoff.

        Also, I don't like the risk that the app will try to add both "styles" of
        facet field to a Document and expect it to work ... they will get a
        cryptic exception about DocValues type changing for a single field.

        Mike, I think that SortedSetDVFacetField should go under o.a.l.facet.sortedset together with all relevant classes.

        Good idea; I'll do that.

        Show
        Michael McCandless added a comment - The use case I'm picturing is an app would choose to use the existing APIs for the hierarchical fields and the sorted set field/accumulator for the flat fields; I think this is a reasonable typical example and I think it's important to make that easy. At search time they make two FacetCollectors ... The new method uses much less RAM for flat fields (doesn't allocate 3 X int [maxOrd] ), which I think for some apps is a good tradeoff. Also, I don't like the risk that the app will try to add both "styles" of facet field to a Document and expect it to work ... they will get a cryptic exception about DocValues type changing for a single field. Mike, I think that SortedSetDVFacetField should go under o.a.l.facet.sortedset together with all relevant classes. Good idea; I'll do that.
        Hide
        Shai Erera added a comment -

        At search time they make two FacetCollectors

        They should be making two FacetAccumulators and a delegating one. It's more efficient than having two collectors which track matching documents per-segment.

        The new method uses much less RAM for flat fields

        I scanned the issue but did not find the right numbers. Do you have them? How much more RAM does the taxo consume (with its 3 int[]) on the non-hierarchical dimensions only, vs the sorted set approach?

        Also, I don't like the risk that the app will try to add both "styles" of
        facet field to a Document and expect it to work ... they will get a
        cryptic exception about DocValues type changing for a single field.

        I think that's fine? It's an exception that the app will hit quite fast (hopefully during testing). And if you put the sorted set stuff in its own package, then we should document that this does not work with FacetFields, unless the categories are put in different CLPs (using PerDimIndexingParams).

        IMO, that that you add "sdv" to sorted set field is premature optimization. No one has used the new method yet, let alone tried to use both of them. Why not do it when it will really hit someone?

        Show
        Shai Erera added a comment - At search time they make two FacetCollectors They should be making two FacetAccumulators and a delegating one. It's more efficient than having two collectors which track matching documents per-segment. The new method uses much less RAM for flat fields I scanned the issue but did not find the right numbers. Do you have them? How much more RAM does the taxo consume (with its 3 int[]) on the non-hierarchical dimensions only, vs the sorted set approach? Also, I don't like the risk that the app will try to add both "styles" of facet field to a Document and expect it to work ... they will get a cryptic exception about DocValues type changing for a single field. I think that's fine? It's an exception that the app will hit quite fast (hopefully during testing). And if you put the sorted set stuff in its own package, then we should document that this does not work with FacetFields, unless the categories are put in different CLPs (using PerDimIndexingParams). IMO, that that you add "sdv" to sorted set field is premature optimization. No one has used the new method yet, let alone tried to use both of them. Why not do it when it will really hit someone?
        Hide
        Shai Erera added a comment -

        Few more comments about the patch:

        • SortedSetDVFacetField:
          • Should it extend SSDVField? It means you will need to construct the BytesRef before you make all the validity checks, but that's ok?
            • You could also make all those validity checks in a static toBytesRef(CP) method?
          • You should use FIP.getFacetDelimChar instead of the DEFAULT constant, as an app can override that
            • Also in SortedSetDVReaderState, tests (where it makes sense).
            • Maybe add a test which overrides FIP.getFacetDelimChar and make sure SSDFF uses it instead of the constant?
          • Instead of doing cp[0] + delim + cp[1] you can call cp.toString(fip.getFacetDelimChar())?
            • In that regard, if we added BytesRef.append(CharSequence) we could impl CP.toBytesRef(char delim) and save the redundant StringBuilder and String allocations.
        • In TestDemoFacets you do doc.add(new SortedSetDocValuesFacetField(new CategoryPath("b", "baz" + FacetIndexingParams.DEFAULT_FACET_DELIM_CHAR + "foo"))). What's the purpose?
          • If facetDelimChar is '/', this will create a drill-down term b/baz/foo right? What does this mean in terms of the test? Is this counted as baz/foo or baz? (perhaps add a comment in the test what exactly does it test?)
        • In SSDVAccumulator:
          • Should the javadocs say "that uses previously indexed {@link SortedSetDocValuesFacetField}"? Will it really work with any SSDV indexed?
          • In the Aggregator, can u add a meaningful message to UnsupportedOpEx?
          • Regarding the top-K, you do not optimize for now the case where someone asks for "all facets" (i.e. numResults=Integer.MAX_VALUE, or > numOrds). Would you like to handle it in this patch? Look at DepthOneFacetResultsHandler, which adds all values to a list and sorts it, rather than working w/ the PQ.
            • If you don't want to handle it now (i.e. if you don't think that improves things) then at least construct the PQ with min(numResults, taxo.size).
          • Is this code really preferred over just setting bottomCount to top().value in every iteration? I think that in the common case we should expect more than K facets per dim, and if there are less, then I don't think it matters if we do that assignment, vs. always calling q.size().
            +          if (q.size() == request.numResults) {
            +            bottomCount = (int) q.top().value;
            +            //System.out.println("    new bottom=" + bottomCount);
            +          }
            
        Show
        Shai Erera added a comment - Few more comments about the patch: SortedSetDVFacetField: Should it extend SSDVField? It means you will need to construct the BytesRef before you make all the validity checks, but that's ok? You could also make all those validity checks in a static toBytesRef(CP) method? You should use FIP.getFacetDelimChar instead of the DEFAULT constant, as an app can override that Also in SortedSetDVReaderState, tests (where it makes sense). Maybe add a test which overrides FIP.getFacetDelimChar and make sure SSDFF uses it instead of the constant? Instead of doing cp [0] + delim + cp [1] you can call cp.toString(fip.getFacetDelimChar())? In that regard, if we added BytesRef.append(CharSequence) we could impl CP.toBytesRef(char delim) and save the redundant StringBuilder and String allocations. In TestDemoFacets you do doc.add(new SortedSetDocValuesFacetField(new CategoryPath("b", "baz" + FacetIndexingParams.DEFAULT_FACET_DELIM_CHAR + "foo"))) . What's the purpose? If facetDelimChar is '/', this will create a drill-down term b/baz/foo right? What does this mean in terms of the test? Is this counted as baz/foo or baz? (perhaps add a comment in the test what exactly does it test?) In SSDVAccumulator: Should the javadocs say "that uses previously indexed {@link SortedSetDocValuesFacetField}"? Will it really work with any SSDV indexed? In the Aggregator, can u add a meaningful message to UnsupportedOpEx? Regarding the top-K, you do not optimize for now the case where someone asks for "all facets" (i.e. numResults=Integer.MAX_VALUE, or > numOrds). Would you like to handle it in this patch? Look at DepthOneFacetResultsHandler , which adds all values to a list and sorts it, rather than working w/ the PQ. If you don't want to handle it now (i.e. if you don't think that improves things) then at least construct the PQ with min(numResults, taxo.size). Is this code really preferred over just setting bottomCount to top().value in every iteration? I think that in the common case we should expect more than K facets per dim, and if there are less, then I don't think it matters if we do that assignment, vs. always calling q.size(). + if (q.size() == request.numResults) { + bottomCount = ( int ) q.top().value; + // System .out.println( " new bottom=" + bottomCount); + }
        Hide
        Michael McCandless added a comment -

        OK I folded in all that feedback Shai, thanks!

        I also improved TestDrillSideways to not always ask for all results
        (so we test the topN).

        In that regard, if we added BytesRef.append(CharSequence) we could impl CP.toBytesRef(char delim) and save the redundant StringBuilder and String allocations.

        Hmm, true. Later...

        In TestDemoFacets you do doc.add(new SortedSetDocValuesFacetField(new CategoryPath("b", "baz" + FacetIndexingParams.DEFAULT_FACET_DELIM_CHAR + "foo"))). What's the purpose?

        I want to assert that the label is allowed to use the delimiter;
        only the dimension is not allowed to.

        Instead of doing cp[0] + delim + cp[1] you can call cp.toString(fip.getFacetDelimChar())>

        I can't ... because CP.toString gets angry about the delim in the
        label when in fact this is fine.

        In the Aggregator, can u add a meaningful message to UnsupportedOpEx?

        I changed this to a no-op and left a comment.

        Is this code really preferred over just setting bottomCount to top().value in every iteration?

        I think that's wrong, i.e you can only apply the bottomCount check once
        the queue is full? (Unless I pre-fill with sentinels ... which feels
        like too much optimizing).

        Show
        Michael McCandless added a comment - OK I folded in all that feedback Shai, thanks! I also improved TestDrillSideways to not always ask for all results (so we test the topN). In that regard, if we added BytesRef.append(CharSequence) we could impl CP.toBytesRef(char delim) and save the redundant StringBuilder and String allocations. Hmm, true. Later... In TestDemoFacets you do doc.add(new SortedSetDocValuesFacetField(new CategoryPath("b", "baz" + FacetIndexingParams.DEFAULT_FACET_DELIM_CHAR + "foo"))). What's the purpose? I want to assert that the label is allowed to use the delimiter; only the dimension is not allowed to. Instead of doing cp [0] + delim + cp [1] you can call cp.toString(fip.getFacetDelimChar())> I can't ... because CP.toString gets angry about the delim in the label when in fact this is fine. In the Aggregator, can u add a meaningful message to UnsupportedOpEx? I changed this to a no-op and left a comment. Is this code really preferred over just setting bottomCount to top().value in every iteration? I think that's wrong, i.e you can only apply the bottomCount check once the queue is full? (Unless I pre-fill with sentinels ... which feels like too much optimizing).
        Hide
        Shai Erera added a comment -

        I can't ... because CP.toString gets angry about the delim in the label when in fact this is fine.

        It's fine only in this case, since you don't support hierarchy right? Then perhaps just drop a comment in the ctor?

        I think that's wrong, i.e you can only apply the bottomCount check once
        the queue is full? (Unless I pre-fill with sentinels ... which feels
        like too much optimizing).

        Argh, you're right. I don't think it's too much optimizing but I'm not going to argue about it. Your call.

        I'm +1 to commit!

        Show
        Shai Erera added a comment - I can't ... because CP.toString gets angry about the delim in the label when in fact this is fine. It's fine only in this case, since you don't support hierarchy right? Then perhaps just drop a comment in the ctor? I think that's wrong, i.e you can only apply the bottomCount check once the queue is full? (Unless I pre-fill with sentinels ... which feels like too much optimizing). Argh, you're right. I don't think it's too much optimizing but I'm not going to argue about it. Your call. I'm +1 to commit!
        Hide
        Michael McCandless added a comment -

        Then perhaps just drop a comment in the ctor?

        OK I'll put a comment where I append w/ the delim explaining why I can't use CP.toString ...

        Thanks Shai! I'll commit soon...

        Show
        Michael McCandless added a comment - Then perhaps just drop a comment in the ctor? OK I'll put a comment where I append w/ the delim explaining why I can't use CP.toString ... Thanks Shai! I'll commit soon...
        Hide
        Commit Tag Bot added a comment -

        [trunk commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1457092

        LUCENE-4795: add new facet method to facet from SortedSetDocValues without using taxonomy index

        Show
        Commit Tag Bot added a comment - [trunk commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1457092 LUCENE-4795 : add new facet method to facet from SortedSetDocValues without using taxonomy index
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Michael McCandless
        http://svn.apache.org/viewvc?view=revision&revision=1457095

        LUCENE-4795: add new facet method to facet from SortedSetDocValues without using taxonomy index

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revision&revision=1457095 LUCENE-4795 : add new facet method to facet from SortedSetDocValues without using taxonomy index
        Hide
        Shai Erera added a comment -

        Mike, I think it will be good if we also have an example code under demo/facet that demonstrates how to use the new field at indexing and search time? Something very similar to SimpleFacetExample, only without the Date/ hierarchical facet. This can be done in a separate issue if you prefer.

        Show
        Shai Erera added a comment - Mike, I think it will be good if we also have an example code under demo/facet that demonstrates how to use the new field at indexing and search time? Something very similar to SimpleFacetExample, only without the Date/ hierarchical facet. This can be done in a separate issue if you prefer.
        Hide
        Michael McCandless added a comment -

        That's a good idea ... I'll open a new issue ... we need to get drill-down working first (LUCENE-4840).

        Show
        Michael McCandless added a comment - That's a good idea ... I'll open a new issue ... we need to get drill-down working first ( LUCENE-4840 ).
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.
        Hide
        ASF subversion and git services added a comment -

        Commit 1556045 from Michael McCandless in branch 'dev/branches/lucene5376'
        [ https://svn.apache.org/r1556045 ]

        LUCENE-4795, LUCENE-5376: expose SortedSetDocValuesFacets in lucene server

        Show
        ASF subversion and git services added a comment - Commit 1556045 from Michael McCandless in branch 'dev/branches/lucene5376' [ https://svn.apache.org/r1556045 ] LUCENE-4795 , LUCENE-5376 : expose SortedSetDocValuesFacets in lucene server

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development