Solr
  1. Solr
  2. SOLR-2403

Problem with facet.sort=lex, shards, and facet.mincount

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 3.2
    • Component/s: search
    • Labels:
      None
    • Environment:

      RHEL5, Ubuntu 10.04

      Description

      I tested this on a recent trunk snapshot (2/25), haven't verified with 3.1 or 1.4.1. I can if necessary and update.

      Solr is not returning the proper number of facet values when sorting alphabetically, using distributed search, and using a facet.mincount that excludes some of the values in the first facet.limit values.

      Easiest explained by example. Sorting alphabetically, the first 20 values for my "subject_facet" field have few documents. 19 facet values have only 1 document associated, and 1 has 2 documents. There are plenty after that have more than 2.

      http://localhost:8082/solr/select?q=*:*&facet=true&facet.field=subject_facet&facet.limit=20&facet.sort=lex&facet.mincount=2
      

      comes back with the expected 20 facet values with >= 2 documents associated.

      If I add a shards parameter that points back to itself, the result is different.

      http://localhost:8082/solr/select?q=*:*&facet=true&facet.field=subject_facet&facet.limit=20&facet.sort=lex&facet.mincount=2&shards=localhost:8082/solr
      

      comes back with only 1 facet value: the single value in the first 20 that had more than 1 document.

      It appears to me that mincount is ignored when doing the original query to the shards, then applied afterwards.

      Let me know if you need any more info.

      Thanks,
      Peter

        Activity

        Hide
        Yonik Seeley added a comment -

        Hmmm, it appears that facet.sort=lex with a mincount above 1 as some of the same issues as sorting by count (i.e. if you want the top 10, it's not possible to predict how many you should request from each shard). Seems like we should mitigate this the same way - by over-requesting.

        Show
        Yonik Seeley added a comment - Hmmm, it appears that facet.sort=lex with a mincount above 1 as some of the same issues as sorting by count (i.e. if you want the top 10, it's not possible to predict how many you should request from each shard). Seems like we should mitigate this the same way - by over-requesting.
        Hide
        Peter Cline added a comment -

        Just to note, the reason this came up for me wasn't a mincount > 1, though that seemed the easiest way to show it. A mincount = 1 with a query that doesn't return many results demonstrates this same problem (since lots of the first alpha-sorted values have count = 0).

        Show
        Peter Cline added a comment - Just to note, the reason this came up for me wasn't a mincount > 1, though that seemed the easiest way to show it. A mincount = 1 with a query that doesn't return many results demonstrates this same problem (since lots of the first alpha-sorted values have count = 0).
        Hide
        Yonik Seeley added a comment -

        Hmmm, yeah. Seems like we should use a mincount of 1 when making the shard requests.
        For higher mincounts we can divide by the number of shards for the shard request mincount.
        For example, if mincount=10, and we are sending to two shards, we know that at least one of the shards must have a count of 5 to make the final mincount cut.

        Show
        Yonik Seeley added a comment - Hmmm, yeah. Seems like we should use a mincount of 1 when making the shard requests. For higher mincounts we can divide by the number of shards for the shard request mincount. For example, if mincount=10, and we are sending to two shards, we know that at least one of the shards must have a count of 5 to make the final mincount cut.
        Hide
        Toke Eskildsen added a comment -

        Dividing by shard count is fairly risky. An example could be the shards

        Shard 1: A(9) B(6) C(10) D(8)
        Shard 2: A(4) B(5) C(4) D(3)
        

        where the request of the top-3 elements with mincount=5 from each shard would give the merged result

        B(11) C(10)
        

        where the correct result would be

        A(13) B(11) C(14) D(11)
        

        The problem with using mincount=1 for each shard-call is of course that the single shard result sets needs to be humongous in order to ensure that the correct values are returned, when the field contains many value with low count and few values with high count. With shards like

        Shard 1: A(1) B(1) C(1) D(1) E(1) F(9) G(1) H(1)
        Shard 2: A(1) B(1) C(1) D(1) E(1) F(1) G(1) H(10)
        

        and a request for mincount=10, all terms must be returned from both shards in order to get the result

        F(10) H(11)
        

        As you, Yonik, point out, a variant of the problem exists when sorting on count. However, for count it is mitigated by the fact that the results from the individual shards are sorted by the selecting key (count). This means that the chance of missing or miscounting tags is low and can be lowered further by relatively little over-requesting.

        With lexical sorting, the selecting key (count again) is independent of the sorting key. Over-requesting helps, but only linear to the fraction of the full result-set from each shard that is requested. Furthermore, the need for over-requesting grows with the number of shards as the overlapping hills can be smaller while still summing up to mincount.

        I do not have any real solution for the problem. One minor improvement would be a collector that kept collecting terms with a mincount=y until limit=n or the number of collected terms with mincount=x was equal to m, where x is the original mincount and y is dependent on the number of shards. This would at least stop the collection process when the result set was guaranteed to contain enough values above the given threshold. It would work well with spikes but poorly with hills just below mincount x and it would still not guarantee correctness of the sums of the counts, only correctness of the terms.

        Show
        Toke Eskildsen added a comment - Dividing by shard count is fairly risky. An example could be the shards Shard 1: A(9) B(6) C(10) D(8) Shard 2: A(4) B(5) C(4) D(3) where the request of the top-3 elements with mincount=5 from each shard would give the merged result B(11) C(10) where the correct result would be A(13) B(11) C(14) D(11) The problem with using mincount=1 for each shard-call is of course that the single shard result sets needs to be humongous in order to ensure that the correct values are returned, when the field contains many value with low count and few values with high count. With shards like Shard 1: A(1) B(1) C(1) D(1) E(1) F(9) G(1) H(1) Shard 2: A(1) B(1) C(1) D(1) E(1) F(1) G(1) H(10) and a request for mincount=10, all terms must be returned from both shards in order to get the result F(10) H(11) As you, Yonik, point out, a variant of the problem exists when sorting on count. However, for count it is mitigated by the fact that the results from the individual shards are sorted by the selecting key (count). This means that the chance of missing or miscounting tags is low and can be lowered further by relatively little over-requesting. With lexical sorting, the selecting key (count again) is independent of the sorting key. Over-requesting helps, but only linear to the fraction of the full result-set from each shard that is requested. Furthermore, the need for over-requesting grows with the number of shards as the overlapping hills can be smaller while still summing up to mincount. I do not have any real solution for the problem. One minor improvement would be a collector that kept collecting terms with a mincount=y until limit=n or the number of collected terms with mincount=x was equal to m, where x is the original mincount and y is dependent on the number of shards. This would at least stop the collection process when the result set was guaranteed to contain enough values above the given threshold. It would work well with spikes but poorly with hills just below mincount x and it would still not guarantee correctness of the sums of the counts, only correctness of the terms.
        Hide
        Yonik Seeley added a comment -

        Dividing by shard count is fairly risky.

        Actually, it seems like it should help? (when mincount is relatively high at least).

        Let's take your example of facet.mincount=10, facet.limit=2, facet.sort=index

        Shard 1: A(1) B(1) C(1) D(1) E(1) F(9) G(1) H(1)
        Shard 2: A(1) B(1) C(1) D(1) E(1) F(1) G(1) H(10)
        

        mincount / nShards = 5, so the shard requests sent will be along the lines of
        facet.mincount=5, facet.limit=5, facet.sort=index (some over-requesting)
        and we will get back
        F(9), H(10)

        The second phase (facet refinement... to true up counts) will retrieve counts from each shard for constraints in the list that it didn't return the first time.
        So shard1 will be asked about H, and shard2 will be asked about F.

        The final response will be F(10),H(11)

        Over-requesting helps, but only linear to the fraction of the full result-set from each shard that is requested.

        Yes, I think you're correct that over-requesting is less useful for sort=index than sort=count.
        Luckily, we can fix the mincount=1 problem and get exact answers for that case, which is the most important case. I think mincount > 1 is relatively rare.

        Show
        Yonik Seeley added a comment - Dividing by shard count is fairly risky. Actually, it seems like it should help? (when mincount is relatively high at least). Let's take your example of facet.mincount=10, facet.limit=2, facet.sort=index Shard 1: A(1) B(1) C(1) D(1) E(1) F(9) G(1) H(1) Shard 2: A(1) B(1) C(1) D(1) E(1) F(1) G(1) H(10) mincount / nShards = 5, so the shard requests sent will be along the lines of facet.mincount=5, facet.limit=5, facet.sort=index (some over-requesting) and we will get back F(9), H(10) The second phase (facet refinement... to true up counts) will retrieve counts from each shard for constraints in the list that it didn't return the first time. So shard1 will be asked about H, and shard2 will be asked about F. The final response will be F(10),H(11) Over-requesting helps, but only linear to the fraction of the full result-set from each shard that is requested. Yes, I think you're correct that over-requesting is less useful for sort=index than sort=count. Luckily, we can fix the mincount=1 problem and get exact answers for that case, which is the most important case. I think mincount > 1 is relatively rare.
        Hide
        Toke Eskildsen added a comment -

        My first example was hills, while the second was spikes, where I agree that the divide-mincount-by-shard# or something similar works well. As it comes down to distribution of counts vs. mincount, we seem to be left with the unsatisfying "it depends, but avoid using mincounts around the average count"-answer.

        I forgot about the refinement phase. That would ensure that my suggestion of a collector with two separate mincounts would return the correct result for counts as well as terms, as long as it did not exceeded the given limits. Alas, it still only helps somewhat and might not be worth the hassle.

        Show
        Toke Eskildsen added a comment - My first example was hills, while the second was spikes, where I agree that the divide-mincount-by-shard# or something similar works well. As it comes down to distribution of counts vs. mincount, we seem to be left with the unsatisfying "it depends, but avoid using mincounts around the average count"-answer. I forgot about the refinement phase. That would ensure that my suggestion of a collector with two separate mincounts would return the correct result for counts as well as terms, as long as it did not exceeded the given limits. Alas, it still only helps somewhat and might not be worth the hassle.
        Hide
        Dmitry Kan added a comment -

        Peter: In one of the distributed faceting sessions we have found out, that the zero facets can be filtered by (undocumented?) facet.zeros parameter. Does anything change, if you set it to 0 (filtering out zero-facets)?

        Show
        Dmitry Kan added a comment - Peter: In one of the distributed faceting sessions we have found out, that the zero facets can be filtered by (undocumented?) facet.zeros parameter. Does anything change, if you set it to 0 (filtering out zero-facets)?
        Hide
        Uwe Schindler added a comment -

        Bulk cose after release of 3.2

        Show
        Uwe Schindler added a comment - Bulk cose after release of 3.2

          People

          • Assignee:
            Unassigned
            Reporter:
            Peter Cline
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development