Lucene - Core
  1. Lucene - Core
  2. LUCENE-2271

Function queries producing scores of -inf or NaN (e.g. 1/x) return incorrect results with TopScoreDocCollector

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.9
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a foolowup to LUCENE-2270, where a part of this problem was fixed (boost = 0 leading to NaN scores, which is also un-intuitive), but in general, function queries in Solr can create these invalid scores easily. In previous version of Lucene these scores ordered correct (except NaN, which mixes up results), but never invalid document ids are returned (like Integer.MAX_VALUE).

      The problem is: TopScoreDocCollector pre-fills the HitQueue with sentinel ScoreDocs with a score of -inf and a doc id of Integer.MAX_VALUE. For the HQ to work, this sentinel must be smaller than all posible values, which is not the case:

      • -inf is equal and the document is not inserted into the HQ, as not competitive, but the HQ is not yet full, so the sentinel values keep in the HQ and result is the Integer.MAX_VALUE docs. This problem is solveable (and only affects the Ordered collector) by chaning the exit condition to:
        if (score <= pqTop.score && pqTop.doc != Integer.MAX_VALUE) {
            // Since docs are returned in-order (i.e., increasing doc Id), a document
            // with equal score to pqTop.score cannot compete since HitQueue favors
            // documents with lower doc Ids. Therefore reject those docs too.
            return;
        }
        
      • The NaN case can be fixed in the same way, but then has another problem: all comparisons with NaN result in false (none of these is true): x < NaN, x > NaN, NaN == NaN. This leads to the fact that HQ's lessThan always returns false, leading to unexspected ordering in the PQ and sometimes the sentinel values do not stay at the top of the queue. A later hit then overrides the top of the queue but leaves the incorrect sentinels unchanged -> invalid results. This can be fixed in two ways in HQ:
        Force all sentinels to the top:
        protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
            if (hitA.doc == Integer.MAX_VALUE)
              return true;
            if (hitB.doc == Integer.MAX_VALUE)
              return false;
            if (hitA.score == hitB.score)
              return hitA.doc > hitB.doc; 
            else
              return hitA.score < hitB.score;
        }
        

        or alternatively have a defined order for NaN (Float.compare sorts them after +inf):

        protected final boolean lessThan(ScoreDoc hitA, ScoreDoc hitB) {
            if (hitA.score == hitB.score)
              return hitA.doc > hitB.doc; 
            else
              return Float.compare(hitA.score, hitB.score) < 0;
        }
        

      The problem with both solutions is, that we have now more comparisons per hit and the use of sentinels is questionable. I would like to remove the sentinels and use the old pre 2.9 code for comparing and using PQ.add() when a competitive hit arrives. The order of NaN would be unspecified.

      To fix the order of NaN, it would be better to replace all score comparisons by Float.compare() [also in FieldComparator].

      I would like to delay 2.9.2 and 3.0.1 until this problem is discussed and solved.

      1. LUCENE-2271.patch
        3 kB
        Michael McCandless
      2. LUCENE-2271.patch
        11 kB
        Uwe Schindler
      3. LUCENE-2271.patch
        10 kB
        Uwe Schindler
      4. LUCENE-2271.patch
        6 kB
        Uwe Schindler
      5. LUCENE-2271.patch
        6 kB
        Uwe Schindler
      6. LUCENE-2271-bench.patch
        27 kB
        Uwe Schindler
      7. LUCENE-2271-maybe-as-separate-collector.patch
        12 kB
        Uwe Schindler
      8. TSDC.patch
        1 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          its not a bug, as its doc'ed to work this way.

           * <p><b>NOTE</b>: The values Float.Nan,
           * Float.NEGATIVE_INFINITY and Float.POSITIVE_INFINITY are
           * not valid scores.  This collector will not properly
           * collect hits with such scores.
          
          Show
          Robert Muir added a comment - its not a bug, as its doc'ed to work this way. * <p><b>NOTE</b>: The values Float .Nan, * Float .NEGATIVE_INFINITY and Float .POSITIVE_INFINITY are * not valid scores. This collector will not properly * collect hits with such scores.
          Hide
          Robert Muir added a comment -

          I don't think we should do anything to fix NaN, such as using more expensive comparisons (Float.compareTo) and stuff. as it is not a number, its an invalid score.

          i think function queries shoudl throw and exception instead of producing NaN, this problem is only limited to function queries.

          I think fixing scores of negative infinity make more sense, as these are unpreventable (again only a problem with function queries!) and at least negative infinity is actually a number.

          i think "fixing" a top-N collector, or "fixing" anything that sorts NaN is wrong. NaN doesnt have a properly defined sort order. NaN has an order hacked into Float.compareTo, but this is different. sorting the primitive type makes no sense, and the documentation should stay that it doesnt work with TSDC.

          Show
          Robert Muir added a comment - I don't think we should do anything to fix NaN, such as using more expensive comparisons (Float.compareTo) and stuff. as it is not a number, its an invalid score. i think function queries shoudl throw and exception instead of producing NaN, this problem is only limited to function queries. I think fixing scores of negative infinity make more sense, as these are unpreventable (again only a problem with function queries!) and at least negative infinity is actually a number. i think "fixing" a top-N collector, or "fixing" anything that sorts NaN is wrong. NaN doesnt have a properly defined sort order. NaN has an order hacked into Float.compareTo, but this is different. sorting the primitive type makes no sense, and the documentation should stay that it doesnt work with TSDC.
          Hide
          Uwe Schindler added a comment -

          This is patch that supports NaN and -inf.

          The cost of the additional checks in HitQueue.lessThan are neglectible, as they only occur when a competitive hit is really inserted into the queue. The check enforces all sentinels to the top of the queue, regardless what their score is (because always NaN != NaN).

          Show
          Uwe Schindler added a comment - This is patch that supports NaN and -inf. The cost of the additional checks in HitQueue.lessThan are neglectible, as they only occur when a competitive hit is really inserted into the queue. The check enforces all sentinels to the top of the queue, regardless what their score is (because always NaN != NaN).
          Hide
          Robert Muir added a comment -

          attached is a patch, written by Uwe. as far as a "bugfix" i prefer this patch, as the more complicated, performance-intrusive NaN fixes I think should be something we do in 3.1

          e.g., "fixing" NaN to work will likely slow down people getting large numbers of results, and i don't think we should do that in bugfix releases.

          but in 3.1, we could change it, include some large results-oriented collectors for these people, and the whole thing would make sense.

          Show
          Robert Muir added a comment - attached is a patch, written by Uwe. as far as a "bugfix" i prefer this patch, as the more complicated, performance-intrusive NaN fixes I think should be something we do in 3.1 e.g., "fixing" NaN to work will likely slow down people getting large numbers of results, and i don't think we should do that in bugfix releases. but in 3.1, we could change it, include some large results-oriented collectors for these people, and the whole thing would make sense.
          Hide
          Uwe Schindler added a comment -

          Sorry reverted a comment remove.

          Show
          Uwe Schindler added a comment - Sorry reverted a comment remove.
          Hide
          Robert Muir added a comment -

          The cost of the additional checks in HitQueue.lessThan are neglectible, as they only occur when a competitive hit is really inserted into the queue.

          This should be benchmarked for MultiSearcher and ParallelMultiSearcher, too, as they also use HitQueue.

          Show
          Robert Muir added a comment - The cost of the additional checks in HitQueue.lessThan are neglectible, as they only occur when a competitive hit is really inserted into the queue. This should be benchmarked for MultiSearcher and ParallelMultiSearcher, too, as they also use HitQueue.
          Hide
          Uwe Schindler added a comment -

          Patch with testcases for trunk, but should work on branches, too (after removing @Override). Without the fixes in HitQueue or TSDC the tests fail.

          Show
          Uwe Schindler added a comment - Patch with testcases for trunk, but should work on branches, too (after removing @Override). Without the fixes in HitQueue or TSDC the tests fail.
          Hide
          Yonik Seeley added a comment -

          its not a bug, as its doc'ed to work this way.

          OK, so it was a design bug too.

          Show
          Yonik Seeley added a comment - its not a bug, as its doc'ed to work this way. OK, so it was a design bug too.
          Hide
          Robert Muir added a comment -

          OK, so it was a design bug too.

          A design bug that function queries score docs with an invalid score (NaN) instead of throwing an exception?

          Show
          Robert Muir added a comment - OK, so it was a design bug too. A design bug that function queries score docs with an invalid score (NaN) instead of throwing an exception?
          Hide
          Yonik Seeley added a comment -

          A design bug that function queries score docs with an invalid score (NaN) instead of throwing an exception?

          No, a design bug that -Inf scores were disallowed, esp since they were handled just fine in the past.

          NaN is different - it's more of a judgement call depending on the cost to handle it.

          Show
          Yonik Seeley added a comment - A design bug that function queries score docs with an invalid score (NaN) instead of throwing an exception? No, a design bug that -Inf scores were disallowed, esp since they were handled just fine in the past. NaN is different - it's more of a judgement call depending on the cost to handle it.
          Hide
          Uwe Schindler added a comment -

          The cost to handle NaN is the modified lessThan() in HitQueue.

          Show
          Uwe Schindler added a comment - The cost to handle NaN is the modified lessThan() in HitQueue.
          Hide
          Uwe Schindler added a comment -

          Improved test, that also checks for increasing doc ids when score identical

          Show
          Uwe Schindler added a comment - Improved test, that also checks for increasing doc ids when score identical
          Hide
          Yonik Seeley added a comment -

          I'm starting to think that handling NaN is as important as handling the infinities.
          This is because once you allow infinities, it's easy to get a NaN with a simple multiplication by 0.

          Show
          Yonik Seeley added a comment - I'm starting to think that handling NaN is as important as handling the infinities. This is because once you allow infinities, it's easy to get a NaN with a simple multiplication by 0.
          Hide
          Robert Muir added a comment -

          This is because once you allow infinities, it's easy to get a NaN with a simple multiplication by 0.

          maybe we should keep it as is then, and only allow finite results

          Show
          Robert Muir added a comment - This is because once you allow infinities, it's easy to get a NaN with a simple multiplication by 0. maybe we should keep it as is then, and only allow finite results
          Hide
          Yonik Seeley added a comment -

          maybe we should keep it as is then, and only allow finite results

          But how? Finite results can combine and overflow into infinite results with things like boolean query, so we can't give a definite range of values that a query can safely produce.

          Show
          Yonik Seeley added a comment - maybe we should keep it as is then, and only allow finite results But how? Finite results can combine and overflow into infinite results with things like boolean query, so we can't give a definite range of values that a query can safely produce.
          Hide
          Robert Muir added a comment -

          But how? Finite results can combine and overflow into infinite results with things like boolean query, so we can't give a definite range of values that a query can safely produce.

          right, but this really only happens with function queries. this is my problem, with normal queries how can this happen?
          its a very very very very special case, and it seems overkill to add all these checks to performance-sensitive top-docs collection just for function queries, to handle NaN and infinites

          Show
          Robert Muir added a comment - But how? Finite results can combine and overflow into infinite results with things like boolean query, so we can't give a definite range of values that a query can safely produce. right, but this really only happens with function queries. this is my problem, with normal queries how can this happen? its a very very very very special case, and it seems overkill to add all these checks to performance-sensitive top-docs collection just for function queries, to handle NaN and infinites
          Hide
          Uwe Schindler added a comment -

          In my opinion we should fix it using the attached patch and in the future 3.1 do some refactoring:

          • no sentinels
          • define a order for NaN, as NaN breaks the complete order of results (because PQ cannot handle the case that lessThan(a,b) returns false and also lessThan(b,a) when NaN is involved)
          Show
          Uwe Schindler added a comment - In my opinion we should fix it using the attached patch and in the future 3.1 do some refactoring: no sentinels define a order for NaN, as NaN breaks the complete order of results (because PQ cannot handle the case that lessThan(a,b) returns false and also lessThan(b,a) when NaN is involved)
          Hide
          Robert Muir added a comment -

          In my opinion we should fix it using the attached patch and in the future 3.1 do some refactoring:

          in general agree, on one condition that it doesnt slow down normal queries, MultiSearcher, or ParallelMultiSearcher significantly (benchmarks)

          in my opinion the thing is documented not to work for these values, so its not a bug at all. if we want to support these values we should redesign the collector (in 3.1)

          Show
          Robert Muir added a comment - In my opinion we should fix it using the attached patch and in the future 3.1 do some refactoring: in general agree, on one condition that it doesnt slow down normal queries, MultiSearcher, or ParallelMultiSearcher significantly (benchmarks) in my opinion the thing is documented not to work for these values, so its not a bug at all. if we want to support these values we should redesign the collector (in 3.1)
          Hide
          Uwe Schindler added a comment -

          Here a simplier patch with sentinels removed. You can maybe think about a better if-check in the out of order collector

          Show
          Uwe Schindler added a comment - Here a simplier patch with sentinels removed. You can maybe think about a better if-check in the out of order collector
          Hide
          Uwe Schindler added a comment -

          Sorry, insertWithOverflow is correct!

          Show
          Uwe Schindler added a comment - Sorry, insertWithOverflow is correct!
          Hide
          Marvin Humphrey added a comment -

          An awful lot of thought went into optimizing those collection algorithms. I
          disagree with many of the design decisions that were made, but it seems rushed
          to blithely revert those optimizations.

          FWIW, the SortCollector in KS (designed on the Lucy list last spring, would be
          in Lucy but some prereqs haven't gone in yet) doesn't have the problem with
          -Inf sentinels. It uses an array of "actions" representing sort rules to
          determine whether a hit is "competitive" and should be inserted into the
          queue; the first action is set to AUTO_ACCEPT (meaning try inserting the hit
          into the queue) until the queue fills up, and then again to AUTO_ACCEPT at the
          start of each segment. It's not necessary to fill up the queue with dummy
          hits beforehand.

          static INLINE bool_t
          SI_competitive(SortCollector *self, i32_t doc_id)
          {
              u8_t *const actions = self->actions;
              u32_t i = 0;
          
              /* Iterate through our array of actions, returning as quickly as
               * possible. */
              do {
                  switch (actions[i] & ACTIONS_MASK) {
                      case AUTO_ACCEPT:
                          return true;
                      case AUTO_REJECT:
                          return false;
                      case AUTO_TIE:
                          break;
                      case COMPARE_BY_SCORE: {
                              float score = Matcher_Score(self->matcher);
                              if  (score > self->bubble_score) {
                                  self->bumped->score = score;
                                  return true;
                              }
                              else if (score < self->bubble_score) {
                                  return false;
                              }
                          }
                          break;
                      case COMPARE_BY_SCORE_REV: {
                          // ...
                      case COMPARE_BY_DOC_ID:
                          // ...
                      case COMPARE_BY_ORD1: {
                          // ...
                  }
              } while (++i < self->num_actions);
          
              /* If we've made it this far and we're still tied, reject the doc so that
               * we prefer items already in the queue.  This has the effect of
               * implicitly breaking ties by doc num, since docs are collected in order.
               */
              return false;
          }
          
          Show
          Marvin Humphrey added a comment - An awful lot of thought went into optimizing those collection algorithms. I disagree with many of the design decisions that were made, but it seems rushed to blithely revert those optimizations. FWIW, the SortCollector in KS (designed on the Lucy list last spring, would be in Lucy but some prereqs haven't gone in yet) doesn't have the problem with -Inf sentinels. It uses an array of "actions" representing sort rules to determine whether a hit is "competitive" and should be inserted into the queue; the first action is set to AUTO_ACCEPT (meaning try inserting the hit into the queue) until the queue fills up, and then again to AUTO_ACCEPT at the start of each segment. It's not necessary to fill up the queue with dummy hits beforehand. static INLINE bool_t SI_competitive(SortCollector *self, i32_t doc_id) { u8_t *const actions = self->actions; u32_t i = 0; /* Iterate through our array of actions, returning as quickly as * possible. */ do { switch (actions[i] & ACTIONS_MASK) { case AUTO_ACCEPT: return true; case AUTO_REJECT: return false; case AUTO_TIE: break; case COMPARE_BY_SCORE: { float score = Matcher_Score(self->matcher); if (score > self->bubble_score) { self->bumped->score = score; return true; } else if (score < self->bubble_score) { return false; } } break; case COMPARE_BY_SCORE_REV: { // ... case COMPARE_BY_DOC_ID: // ... case COMPARE_BY_ORD1: { // ... } } while (++i < self->num_actions); /* If we've made it this far and we're still tied, reject the doc so that * we prefer items already in the queue. This has the effect of * implicitly breaking ties by doc num, since docs are collected in order. */ return false; }
          Hide
          Uwe Schindler added a comment -

          Attached is the patch with the testcase from the previous one (with sentinels). All tests pass.

          Show
          Uwe Schindler added a comment - Attached is the patch with the testcase from the previous one (with sentinels). All tests pass.
          Hide
          Michael McCandless added a comment -

          I would rather not change the core default collector here, when
          sorting by score.

          All of the patches being considered would add an extra comparison/and
          (or maybe if) per collect call, which while presumably small, is still
          an added cost.

          Likely this added cost is in the noise of a benchmark so we wouldn't
          be able to conclude much by benching... but enough of these added
          costs will eventually add up. At the end of the day you are adding
          non-zero work for the CPU to do, for every collect.

          This is Lucene's searching hotspot, so we should only be adding extra
          code on this critical path when it's really needed.

          It's ashame to make everyone pay that price when in practice very few
          users need to collect -Inf/NaN scoring docs. This hasn't been hit in
          a "real" use case yet.

          In fact, before 2.9 (ie up to and including 2.4), we by default
          filtered out all docs scoring <= 0.0, so -Inf and NaN were always
          filtered out, anyway.

          Users who somehow do hit this in a real use case have a good
          workaround: use TopFieldCollector, sorting by Relevance. This will
          properly handle -Inf, and will at least collect NaN scoring docs (but
          their sort order will be random, as it always has). Or, use
          PositiveScoresOnlyCollector (to get back to Lucene 2.4 behavior). Or,
          create a custom collector.

          Show
          Michael McCandless added a comment - I would rather not change the core default collector here, when sorting by score. All of the patches being considered would add an extra comparison/and (or maybe if) per collect call, which while presumably small, is still an added cost. Likely this added cost is in the noise of a benchmark so we wouldn't be able to conclude much by benching... but enough of these added costs will eventually add up. At the end of the day you are adding non-zero work for the CPU to do, for every collect. This is Lucene's searching hotspot, so we should only be adding extra code on this critical path when it's really needed. It's ashame to make everyone pay that price when in practice very few users need to collect -Inf/NaN scoring docs. This hasn't been hit in a "real" use case yet. In fact, before 2.9 (ie up to and including 2.4), we by default filtered out all docs scoring <= 0.0, so -Inf and NaN were always filtered out, anyway. Users who somehow do hit this in a real use case have a good workaround: use TopFieldCollector, sorting by Relevance. This will properly handle -Inf, and will at least collect NaN scoring docs (but their sort order will be random, as it always has). Or, use PositiveScoresOnlyCollector (to get back to Lucene 2.4 behavior). Or, create a custom collector.
          Hide
          Uwe Schindler added a comment -

          Here is a new impl that only has exactly one additional check in the initial collection (when th pq is not yet full). After the PQ is full, the collector is replaced by the short-cutting one.

          This impl should even be faster than before, if the additional method call does not count and is removed by the JVM (which it should, because its clearly predictable)

          Show
          Uwe Schindler added a comment - Here is a new impl that only has exactly one additional check in the initial collection (when th pq is not yet full). After the PQ is full, the collector is replaced by the short-cutting one. This impl should even be faster than before, if the additional method call does not count and is removed by the JVM (which it should, because its clearly predictable)
          Hide
          Uwe Schindler added a comment -

          more improved

          Show
          Uwe Schindler added a comment - more improved
          Hide
          Uwe Schindler added a comment -

          More optimized version with more local variables. This is the version for the benchmark-try.

          Show
          Uwe Schindler added a comment - More optimized version with more local variables. This is the version for the benchmark-try.
          Hide
          Uwe Schindler added a comment -

          Here a benchmark task made by grant. Run collector.alg and wait long enough.

          Show
          Uwe Schindler added a comment - Here a benchmark task made by grant. Run collector.alg and wait long enough.
          Hide
          Uwe Schindler added a comment -

          More improved version, now equal to prefilled queue case, as the collector reuses overflowed ScoreDoc instances.

          Show
          Uwe Schindler added a comment - More improved version, now equal to prefilled queue case, as the collector reuses overflowed ScoreDoc instances.
          Hide
          Uwe Schindler added a comment - - edited

          I did some benchmarks (Java 1.5.0_22, 64bit, Win7, Core2Duo P8700) will do more tomorrow when i set up a large testing environment with 3 separate checkouts containing the three collector versions):

          Show
          Uwe Schindler added a comment - - edited I did some benchmarks (Java 1.5.0_22, 64bit, Win7, Core2Duo P8700) will do more tomorrow when i set up a large testing environment with 3 separate checkouts containing the three collector versions): The latest approach ( https://issues.apache.org/jira/secure/attachment/12436458/LUCENE-2271.patch ) with no sentinels using the delegation and exchanging the inner collector was as fast as the original unpatched version The approach with sentinels but fixed HitQueue ordering and extra checks ( https://issues.apache.org/jira/secure/attachment/12436329/LUCENE-2271.patch ), showed (as exspected) a little overhead: The ordered collector was as fast as the unpatched unordered collector (because one check more) - so i would not use this patch
          Hide
          Uwe Schindler added a comment -

          Fix an issue when numDocs==0.

          Show
          Uwe Schindler added a comment - Fix an issue when numDocs==0.
          Hide
          Michael McCandless added a comment -

          This is a delightfully clever solution, delegating to a contained
          collector and then swapping in one collector for the startup (when the
          queue is not yet full) and then a new collector once the queue is
          full.

          It's the closest equivalent we in javaland can reach, to using
          function pointers in C.

          But, I don't think we should commit this – this makes the code more
          complex, and doesn't really gain enough (performance is the same) in
          return. It also relies more heavily on the JVM being able to optimize away
          the method call.

          Yes it handles -inf/nan, but I don't think Lucene's default
          sort-by-relevance collector needs to (prior to 2.9 we silently
          filtered out such hits, as well as all hits with score <= 0.0).

          Show
          Michael McCandless added a comment - This is a delightfully clever solution, delegating to a contained collector and then swapping in one collector for the startup (when the queue is not yet full) and then a new collector once the queue is full. It's the closest equivalent we in javaland can reach, to using function pointers in C. But, I don't think we should commit this – this makes the code more complex, and doesn't really gain enough (performance is the same) in return. It also relies more heavily on the JVM being able to optimize away the method call. Yes it handles -inf/nan, but I don't think Lucene's default sort-by-relevance collector needs to (prior to 2.9 we silently filtered out such hits, as well as all hits with score <= 0.0).
          Hide
          Michael McCandless added a comment -

          I think we should fix a few doc issues, and add asserts (attached patch for trunk).

          Show
          Michael McCandless added a comment - I think we should fix a few doc issues, and add asserts (attached patch for trunk).
          Hide
          Uwe Schindler added a comment -

          After applying Mike's patch (with modified asserts to correctly detect NaN), updated my patch of the delegating and -inf/NaN aware TopScoreDocCollector.

          Maybe we should add it as a separate collector for function queries in 3.1. Maybe with correct NaN ordering?

          Show
          Uwe Schindler added a comment - After applying Mike's patch (with modified asserts to correctly detect NaN), updated my patch of the delegating and -inf/NaN aware TopScoreDocCollector. Maybe we should add it as a separate collector for function queries in 3.1. Maybe with correct NaN ordering?
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Lucene 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Lucene 4.9.

            People

            • Assignee:
              Unassigned
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development