Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-12343

JSON Field Facet refinement can return incorrect counts/stats for sorted buckets

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.5, 8.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      The way JSON Facet's simple refinement "re-sorts" buckets after refinement can cause refined buckets to be "bumped out" of the topN based on the refined counts/stats depending on the sort - causing unrefined buckets originally discounted in phase#2 to bubble up into the topN and be returned to clients with inaccurate counts/stats

      The simplest way to demonstrate this bug (in some data sets) is with a sort: 'count asc' facet:

      • assume shard1 returns termX & termY in phase#1 because they have very low shard1 counts
        • but not returned at all by shard2, because these terms both have very high shard2 counts.
      • Assume termX has a slightly lower shard1 count then termY, such that:
        • termX "makes the cut" off for the limit=N topN buckets
        • termY does not make the cut, and is the "N+1" known bucket at the end of phase#1
      • termX then gets included in the phase#2 refinement request against shard2
        • termX now has a much higher known total count then termY
        • the coordinator now sorts termX "worse" in the sorted list of buckets then termY
        • which causes termY to bubble up into the topN
      • termY is ultimately included in the final result with incomplete count/stat/sub-facet data instead of termX
        • this is all indepenent of the possibility that termY may actually have a significantly higher total count then termX across the entire collection
        • the key problem is that all/most of the other terms returned to the client have counts/stats that are the cumulation of all shards, but termY only has the contributions from shard1

      Important Notes:

      • This scenerio can happen regardless of the amount of overrequest used. Additional overrequest just increases the number of "extra" terms needed in the index with "better" sort values then termX & termY in shard2
      • sort: 'count asc' is not just an exceptional/pathelogical case:
        • any function sort where additional data provided shards during refinement can cause a bucket to "sort worse" can also cause this problem.
        • Examples: sum(price_i) asc , min(price_i) desc , avg(price_i) asc|desc , etc...

        Attachments

        1. SOLR-12343.patch
          5 kB
          Hoss Man
        2. SOLR-12343.patch
          19 kB
          Hoss Man
        3. SOLR-12343.patch
          30 kB
          Hoss Man
        4. SOLR-12343.patch
          30 kB
          Yonik Seeley
        5. SOLR-12343.patch
          33 kB
          Hoss Man
        6. SOLR-12343.patch
          37 kB
          Hoss Man
        7. SOLR-12343.patch
          47 kB
          Hoss Man
        8. SOLR-12343.patch
          49 kB
          Hoss Man
        9. SOLR-12343.patch
          50 kB
          Hoss Man
        10. __incomplete_processEmpty_microfix.patch
          14 kB
          Hoss Man

          Issue Links

            Activity

              People

              • Assignee:
                yseeley@gmail.com Yonik Seeley
                Reporter:
                hossman Hoss Man
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: