Lucene - Core
  1. Lucene - Core
  2. LUCENE-3390

Incorrect sort by Numeric values for documents missing the sorting field

    Details

    • Lucene Fields:
      Patch Available

      Description

      While sorting results over a numeric field, documents which do not contain a value for the sorting field seem to get 0 (ZERO) value in the sort. (Tested against Double, Float, Int & Long numeric fields ascending and descending order).
      This behavior is unexpected, as zero is "comparable" to the rest of the values. A better solution would either be allowing the user to define such a "non-value" default, or always bring those document results as the last ones.

      Example scenario:
      Adding 3 documents, 1st with value 3.5d, 2nd with -10d, and 3rd without any value.
      Searching with MatchAllDocsQuery, with sort over that field in descending order yields the docid results of 0, 2, 1.

      Asking for the top 2 documents brings the document without any value as the 2nd result - which seems as a bug?

      1. LUCENE-3390.patch
        23 kB
        Doron Cohen
      2. LUCENE-3390-BitsInterface.patch
        29 kB
        Uwe Schindler
      3. LUCENE-3390-BitsInterface.patch
        26 kB
        Doron Cohen
      4. LUCENE-3390-BitsInterface.patch
        24 kB
        Uwe Schindler
      5. LUCENE-3390-fix-like-trunk.patch
        22 kB
        Uwe Schindler
      6. LUCENE-3390-fix-like-trunk.patch
        19 kB
        Uwe Schindler
      7. LUCENE-3390-fix-like-trunk.patch
        19 kB
        Uwe Schindler
      8. LUCENE-3390-fix-like-trunk.patch
        13 kB
        Uwe Schindler
      9. LUCENE-3390-inverted.patch
        12 kB
        Uwe Schindler
      10. SortByDouble.java
        2 kB
        Gilad Barkai

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Bulk close after release of 3.5

          Show
          Uwe Schindler added a comment - Bulk close after release of 3.5
          Hide
          Uwe Schindler added a comment -

          Committed 3.x branch revision: 1173745

          Show
          Uwe Schindler added a comment - Committed 3.x branch revision: 1173745
          Hide
          Michael McCandless added a comment -

          Looks great!

          Show
          Michael McCandless added a comment - Looks great!
          Hide
          Uwe Schindler added a comment -

          Patch with the BitSet inverted. We break backwards compatibility so this is not an issue at all.

          Show
          Uwe Schindler added a comment - Patch with the BitSet inverted. We break backwards compatibility so this is not an issue at all.
          Hide
          Uwe Schindler added a comment -

          When discussing about the forward port with Mike McCandless on IRC, we thought the double inversion is useless (it was in Doron's patch, because he wanted to use DocIdSetIterator effectively).

          We changed the name to FieldCache.getDocsWithField().

          Patch is easy.

          Show
          Uwe Schindler added a comment - When discussing about the forward port with Mike McCandless on IRC, we thought the double inversion is useless (it was in Doron's patch, because he wanted to use DocIdSetIterator effectively). We changed the name to FieldCache.getDocsWithField(). Patch is easy.
          Hide
          Uwe Schindler added a comment -

          Committed 3.x branch revision: 1173701

          Show
          Uwe Schindler added a comment - Committed 3.x branch revision: 1173701
          Hide
          Michael McCandless added a comment -

          I think that's ready to commit.

          +1, looks great! Thanks Uwe.

          Show
          Michael McCandless added a comment - I think that's ready to commit. +1, looks great! Thanks Uwe.
          Hide
          Uwe Schindler added a comment -

          I added a further test in TestFieldCache to check the Bits returned.

          I think that's ready to commit.

          Show
          Uwe Schindler added a comment - I added a further test in TestFieldCache to check the Bits returned. I think that's ready to commit.
          Hide
          Doron Cohen added a comment -

          Attached patch with a test that fails before this fix (otherwise patch same as previous).

          The test uses 4 collectors simultaneously, each with different missing values.

          Show
          Doron Cohen added a comment - Attached patch with a test that fails before this fix (otherwise patch same as previous). The test uses 4 collectors simultaneously, each with different missing values.
          Hide
          Doron Cohen added a comment -

          I wrote a small test that should fail with the bug Uwe fixed here and pass with the fix. For some reason it is still failing even with that fix. Tried this with previous patch, will now try with last one, though I think it it should pass also with previous one. I'll give it another try.

          Show
          Doron Cohen added a comment - I wrote a small test that should fail with the bug Uwe fixed here and pass with the fix. For some reason it is still failing even with that fix. Tried this with previous patch, will now try with last one, though I think it it should pass also with previous one. I'll give it another try.
          Hide
          Michael McCandless added a comment -

          Looks great Uwe! I think we can assert that the cardinality is <= numDocs, and then short-circuit the common == numDocs (all docs have values) case like you are.

          I love how 3.x handles the unvalued bits... I think we should port this forward to trunk, but maybe make it possible to set the bits as we build up the values (single pass) if you specify up front you want the bit set. I'll open a new issue for this...

          Show
          Michael McCandless added a comment - Looks great Uwe! I think we can assert that the cardinality is <= numDocs, and then short-circuit the common == numDocs (all docs have values) case like you are. I love how 3.x handles the unvalued bits... I think we should port this forward to trunk, but maybe make it possible to set the bits as we build up the values (single pass) if you specify up front you want the bit set. I'll open a new issue for this...
          Hide
          Uwe Schindler added a comment -

          Here a patch with a more clean API (as noted by Mike McCandless):

          • backported the Bits interface from Lucene trunk (do a svn cp http://svn.apache.org/..../trunk/..../Bits.java before applying the patch
          • Added interface to the well-known impls in util package
          • FieldCache.getUnValuesDocs returns Bits now which makes the API very clean

          This breaks backwards a bit more, as Bits does not extend DocIdSet, so code using the new FieldCache method will break, before recompilation was enough (as FixedBitSet extends DocIdSet).

          Mike: How about this?

          Show
          Uwe Schindler added a comment - Here a patch with a more clean API (as noted by Mike McCandless): backported the Bits interface from Lucene trunk (do a svn cp http://svn.apache.org/..../trunk/..../Bits.java before applying the patch Added interface to the well-known impls in util package FieldCache.getUnValuesDocs returns Bits now which makes the API very clean This breaks backwards a bit more, as Bits does not extend DocIdSet, so code using the new FieldCache method will break, before recompilation was enough (as FixedBitSet extends DocIdSet). Mike: How about this?
          Hide
          Uwe Schindler added a comment -

          In my opinion a much more clean and simple approach for FieldComaparator and all other stuff would be the following, as it removes all additional branches from FieldComaparator and makes the code as simple as it was before missingValues at all (also in trunk):

          Thinking more about it: Another apporoach (also possible for trunk) is to supply the missing value to FieldCache.getXxx(). The FieldCache would the first use Arrays.fill() to populate the FieldCache array with the default value and after that populate the index values. The drawback is that you get a separate FieldCache entry for each distinct missing value. For the above se case, you would have two float/double price caches.

          We just have to think about additional memory requirements (which would affect only users actually using different missingValues for several searches). From my perspective this is much cleaner, as you can pass in a missingValue directly when populating the FieldCache. FieldComaparator would simply call FieldCache.DEFAULT.getInts(reader, parser, defaultValue). The cache would use the triplet including defaultValue as key. The sorting code would not need to be changed at all (this is similar to Doron's idea, but moved to FieldCache and not FC.setNextReader).

          We should think about this in an additional issue and for now only fix the broken implementation in 3.x.

          Show
          Uwe Schindler added a comment - In my opinion a much more clean and simple approach for FieldComaparator and all other stuff would be the following, as it removes all additional branches from FieldComaparator and makes the code as simple as it was before missingValues at all (also in trunk): Thinking more about it: Another apporoach (also possible for trunk) is to supply the missing value to FieldCache.getXxx(). The FieldCache would the first use Arrays.fill() to populate the FieldCache array with the default value and after that populate the index values. The drawback is that you get a separate FieldCache entry for each distinct missing value. For the above se case, you would have two float/double price caches. We just have to think about additional memory requirements (which would affect only users actually using different missingValues for several searches). From my perspective this is much cleaner, as you can pass in a missingValue directly when populating the FieldCache. FieldComaparator would simply call FieldCache.DEFAULT.getInts(reader, parser, defaultValue). The cache would use the triplet including defaultValue as key. The sorting code would not need to be changed at all (this is similar to Doron's idea, but moved to FieldCache and not FC.setNextReader). We should think about this in an additional issue and for now only fix the broken implementation in 3.x.
          Hide
          Uwe Schindler added a comment -

          Why did we need to narrow the return value from FC.getUnvaluedDocs to FixedBitSet?

          We have no Bits interface in 3.x. And DocIdSet is not random access. Maybe we should backport the Bits interface?

          Show
          Uwe Schindler added a comment - Why did we need to narrow the return value from FC.getUnvaluedDocs to FixedBitSet? We have no Bits interface in 3.x. And DocIdSet is not random access. Maybe we should backport the Bits interface?
          Hide
          Michael McCandless added a comment -

          I would love to take this even further, and have trunk's FC implement missing values the same way 3.x does (ie, separate FC method to getUnvaluedDocs, rather than bundling this bitset w/ the computation of the values array). But we should do that separately.

          This is actually a serious bug; maybe we should release 3.4.1 soon (this would also fix the Maven packaging problem in 3.4.0).

          Why did we need to narrow the return value from FC.getUnvaluedDocs to FixedBitSet?

          Show
          Michael McCandless added a comment - I would love to take this even further, and have trunk's FC implement missing values the same way 3.x does (ie, separate FC method to getUnvaluedDocs, rather than bundling this bitset w/ the computation of the values array). But we should do that separately. This is actually a serious bug; maybe we should release 3.4.1 soon (this would also fix the Maven packaging problem in 3.4.0). Why did we need to narrow the return value from FC.getUnvaluedDocs to FixedBitSet?
          Hide
          Uwe Schindler added a comment -

          Final patch:

          • Added changes for backwards breaks
          • Removed the bogus docFreq check
          • Optimized the case of empty unvalued docs bit set (like in trunk)

          This patch is now 100% in line with trunk. The code was already tested in trunk and does not affect sort speed for the common case without missing value, as the compiler will ignore the additional null check.

          Will commit later this day.

          Show
          Uwe Schindler added a comment - Final patch: Added changes for backwards breaks Removed the bogus docFreq check Optimized the case of empty unvalued docs bit set (like in trunk) This patch is now 100% in line with trunk. The code was already tested in trunk and does not affect sort speed for the common case without missing value, as the compiler will ignore the additional null check. Will commit later this day.
          Hide
          Uwe Schindler added a comment -

          Doron: That's exactly the problem. This easy use case is problematic:

          You allow sorting by Price. The user can switch between forward and backward sorting. In all cases, you want all articles without a price at the beginning. To achieve this, you have to set the price value e.g. to negative_infinity for the forward sorting, but positive_infinity for backwards sorting. If now two users are using your user interface in parallel, they collide.

          The fix used here is identical to Lucene trunk and we should keep the code similar. FieldComparator is now almost identical between trunk and 3.x (except the new BytesRef/Docvalues stuff in trunk).

          Thinking more about it: Another apporoach (also possible for trunk) is to supply the missing value to FieldCache.getXxx(). The FieldCache would the first use Arrays.fill() to populate the FieldCache array with the default value and after that populate the index values. The drawback is that you get a separate FieldCache entry for each distinct missing value. For the above se case, you would have two float/double price caches.

          Show
          Uwe Schindler added a comment - Doron: That's exactly the problem. This easy use case is problematic: You allow sorting by Price. The user can switch between forward and backward sorting. In all cases, you want all articles without a price at the beginning. To achieve this, you have to set the price value e.g. to negative_infinity for the forward sorting, but positive_infinity for backwards sorting. If now two users are using your user interface in parallel, they collide. The fix used here is identical to Lucene trunk and we should keep the code similar. FieldComparator is now almost identical between trunk and 3.x (except the new BytesRef/Docvalues stuff in trunk). Thinking more about it: Another apporoach (also possible for trunk) is to supply the missing value to FieldCache.getXxx(). The FieldCache would the first use Arrays.fill() to populate the FieldCache array with the default value and after that populate the index values. The drawback is that you get a separate FieldCache entry for each distinct missing value. For the above se case, you would have two float/double price caches.
          Hide
          Doron Cohen added a comment -

          Hi Uwe, thanks for catching this.
          I agree that this is a bug, and needs to be fixed.
          Just to make sure that we agree on what the problem is, let me describe it again: in current 3x code in setNextReader() we extract the values from the cache, e.g. by

          FieldCache.DEFAULT.getDoubles(reader, field, parser);

          and, if a missing value was set, we iterate the unvalued docs and set them to that missing value. However this settings takes place at the same array just obtained from the cache, and so this is (1) inefficient as it will happen again in the next sort with same field, (2) incorrect as if two sorts of same field have different missing value they will collide, and (3) unsafe as you indicated.
          I was very happy with the reuse of the cache for caching the missing values so I would like to try to solve this with that "frame"... More later...

          Show
          Doron Cohen added a comment - Hi Uwe, thanks for catching this. I agree that this is a bug, and needs to be fixed. Just to make sure that we agree on what the problem is, let me describe it again: in current 3x code in setNextReader() we extract the values from the cache, e.g. by FieldCache.DEFAULT.getDoubles(reader, field, parser); and, if a missing value was set, we iterate the unvalued docs and set them to that missing value. However this settings takes place at the same array just obtained from the cache, and so this is (1) inefficient as it will happen again in the next sort with same field, (2) incorrect as if two sorts of same field have different missing value they will collide, and (3) unsafe as you indicated. I was very happy with the reuse of the cache for caching the missing values so I would like to try to solve this with that "frame"... More later...
          Hide
          Uwe Schindler added a comment -

          I am currently looking into a possibility to remove the additional uninversion, if the FieldCache already uninverted the field. The bitset should be populated together with the values, but still stored as a separate cache entry.

          After reading the comments in LUCENE-2649 I don't want to discuss this again. In my opinion, it does not matter to always produce the bitset, even if not needed. But as there are people who care about maxDoc/8 additional bytes in FieldCache seem to fight against this idea, so I would leave this open.

          I would like to fix the issue, any comments on the new and trunk-like FieldComparator implementation/class structure?

          Show
          Uwe Schindler added a comment - I am currently looking into a possibility to remove the additional uninversion, if the FieldCache already uninverted the field. The bitset should be populated together with the values, but still stored as a separate cache entry. After reading the comments in LUCENE-2649 I don't want to discuss this again. In my opinion, it does not matter to always produce the bitset, even if not needed. But as there are people who care about maxDoc/8 additional bytes in FieldCache seem to fight against this idea, so I would leave this open. I would like to fix the issue, any comments on the new and trunk-like FieldComparator implementation/class structure?
          Hide
          Uwe Schindler added a comment -

          Small change:

          • added a TODO: In my opinion the mentioned check is bogus and can never work. IR.docFreq(new Term(field)) will never return the sum of all docFreqs from a field.
          • Moved the empty bitmask check down and use cardinality().

          I am currently looking into a possibility to remove the additional uninversion, if the FieldCache already uninverted the field. The bitset should be populated together with the values, but still stored as a separate cache entry.

          Show
          Uwe Schindler added a comment - Small change: added a TODO: In my opinion the mentioned check is bogus and can never work. IR.docFreq(new Term(field)) will never return the sum of all docFreqs from a field. Moved the empty bitmask check down and use cardinality(). I am currently looking into a possibility to remove the additional uninversion, if the FieldCache already uninverted the field. The bitset should be populated together with the values, but still stored as a separate cache entry.
          Hide
          Uwe Schindler added a comment -

          Here my final fix. The code in the comparator is exactly like trunk, the NumericComparator superclass was added and the API is now similar to trunk. Also fixed the remaining TODO.

          With this code we break the already broken 3.4 code, but make it similar to trunk, so migration is easier again. It is unlikely that somebody is affected by this, in most cases a simple recompilation fixes the problem.

          Show
          Uwe Schindler added a comment - Here my final fix. The code in the comparator is exactly like trunk, the NumericComparator superclass was added and the API is now similar to trunk. Also fixed the remaining TODO. With this code we break the already broken 3.4 code, but make it similar to trunk, so migration is easier again. It is unlikely that somebody is affected by this, in most cases a simple recompilation fixes the problem.
          Hide
          Uwe Schindler added a comment -

          Here a quick fix:

          • Uses same code like trunk inside the comparators
          • Uses FixedBitSet instead of OpenBitSet

          Unfortunately this breaks backwards compatibility as explained before.

          This patch needs some cleanup (remove code duplication and maybe move the missingValue code to a numeric base class like in trunk). This is also broken in current code, it has missing value code in FieldComparator base class, which should only be in a NumericFieldComparator abstract base class (missing in 3.x).

          Show
          Uwe Schindler added a comment - Here a quick fix: Uses same code like trunk inside the comparators Uses FixedBitSet instead of OpenBitSet Unfortunately this breaks backwards compatibility as explained before. This patch needs some cleanup (remove code duplication and maybe move the missingValue code to a numeric base class like in trunk). This is also broken in current code, it has missing value code in FieldComparator base class, which should only be in a NumericFieldComparator abstract base class (missing in 3.x).
          Hide
          Uwe Schindler added a comment - - edited

          The problem is that it's not easy to fix, because the unvalued values BitSet is exposed as DocIdSet not Bits like in trunk. The only quick fix is to clone the cached values before modifying in setNextReader and leave code unchanged.

          Correct fix is to return Bits from getUnvaluedValues and so can do random access in compare/compareBottom, FieldComparator code can be copied 1:1 from trunk, only small changes needed. This would break code already using the new APIs in 3.x, but this is too broken, so we must change/fix this.

          Show
          Uwe Schindler added a comment - - edited The problem is that it's not easy to fix, because the unvalued values BitSet is exposed as DocIdSet not Bits like in trunk. The only quick fix is to clone the cached values before modifying in setNextReader and leave code unchanged. Correct fix is to return Bits from getUnvaluedValues and so can do random access in compare/compareBottom, FieldComparator code can be copied 1:1 from trunk, only small changes needed. This would break code already using the new APIs in 3.x, but this is too broken, so we must change/fix this.
          Hide
          Uwe Schindler added a comment -

          There is a bug in this impl:
          If you have two different SortFields with different missing values and you do sorts on both in parallel, the setNextReader in the comparator modifies the shared field cache, leading to concurrency issues.

          I dont understand why this is done in setNextReader? Why not do the Bits check in the comparator itsself instead of merging the missing values directly into the field cache, which is a violation to multithreading, caching,...

          Also doing the merging of missing values into the cache on every setNextReader seems like lots of additional work on every sort.

          We should do the check for the missing value directly in the compare/compareBottom method (like we do on trunk).

          Show
          Uwe Schindler added a comment - There is a bug in this impl: If you have two different SortFields with different missing values and you do sorts on both in parallel, the setNextReader in the comparator modifies the shared field cache, leading to concurrency issues. I dont understand why this is done in setNextReader? Why not do the Bits check in the comparator itsself instead of merging the missing values directly into the field cache, which is a violation to multithreading, caching,... Also doing the merging of missing values into the cache on every setNextReader seems like lots of additional work on every sort. We should do the check for the missing value directly in the compare/compareBottom method (like we do on trunk).
          Hide
          Robert Muir added a comment -

          +1 to revisit how this was done in trunk.

          Show
          Robert Muir added a comment - +1 to revisit how this was done in trunk.
          Hide
          Michael McCandless added a comment -

          Also, can we use FastBitSet, not OpenBitSet, here?

          Show
          Michael McCandless added a comment - Also, can we use FastBitSet, not OpenBitSet, here?
          Hide
          Michael McCandless added a comment -

          I like how we solved this in 3.x! Ie, a whole separate entry for holding a bitset indicating if the doc has a value.

          This is generally useful, alone, ie one can just pull this bitset and use it directly.

          It's also nice because it's one source that computes this, vs N copies (one per value) that we have on trunk.

          I guess the downside is it takes 2 passes over the terms (one to get the values, another to fill this bitset), but maybe that tradeoff is worth not duplicating the code all over... maybe we should take a similar approach in trunk?

          Show
          Michael McCandless added a comment - I like how we solved this in 3.x! Ie, a whole separate entry for holding a bitset indicating if the doc has a value. This is generally useful, alone, ie one can just pull this bitset and use it directly. It's also nice because it's one source that computes this, vs N copies (one per value) that we have on trunk. I guess the downside is it takes 2 passes over the terms (one to get the values, another to fill this bitset), but maybe that tradeoff is worth not duplicating the code all over... maybe we should take a similar approach in trunk?
          Hide
          Doron Cohen added a comment -

          Fixed in 3.x r1164794.
          Thanks Gilad!

          Show
          Doron Cohen added a comment - Fixed in 3.x r1164794. Thanks Gilad!
          Hide
          Doron Cohen added a comment -

          Attached patch fixing this bug.
          TestSort was enhanced to test the new setMissingValue() method - actually merging the test from trunk r1002460 (LUCENE-2671).

          All search test passed (running the rest now..)

          Show
          Doron Cohen added a comment - Attached patch fixing this bug. TestSort was enhanced to test the new setMissingValue() method - actually merging the test from trunk r1002460 ( LUCENE-2671 ). All search test passed (running the rest now..)
          Hide
          Doron Cohen added a comment -

          I think it may be useful to solve this also in 3x - without the cached-array-creators of the trunk, but with similar concept - i.e. an additional cache "type" will cache the docs missing values for certain field, and will allow to use the default value assigned by apps calling setMissingValue() as in trunk. Gilad and I looked at this, will post a patch shortly for review...

          Show
          Doron Cohen added a comment - I think it may be useful to solve this also in 3x - without the cached-array-creators of the trunk, but with similar concept - i.e. an additional cache "type" will cache the docs missing values for certain field, and will allow to use the default value assigned by apps calling setMissingValue() as in trunk. Gilad and I looked at this, will post a patch shortly for review...
          Hide
          Michael McCandless added a comment -

          In Lucene trunk I believe this (how/where a doc that's missing the fields sorts) is now controllable: you can call SortField.setMissingValue.

          Show
          Michael McCandless added a comment - In Lucene trunk I believe this (how/where a doc that's missing the fields sorts) is now controllable: you can call SortField.setMissingValue.
          Hide
          Hoss Man added a comment -

          a few semi-related issues based on a quick search of "sortMissingLast" ... there are lots of others that are likely just as on point (or possible more current)

          Show
          Hoss Man added a comment - a few semi-related issues based on a quick search of "sortMissingLast" ... there are lots of others that are likely just as on point (or possible more current)
          Hide
          Hoss Man added a comment -

          This is a fairly well known behavior of the FieldCache in Lucene (and all sorting uses FieldCache) ... i'm guessing it's probably not documented very well however.

          There are a bunch of open issues that relate to trying to rectify this problem - but the one known work arround at this point is to not use NumericField, and use StrIndex based FieldCache (which does understand docs w/o values). This is the approach used in Solr for the "Sortable*Field" numerics.

          Show
          Hoss Man added a comment - This is a fairly well known behavior of the FieldCache in Lucene (and all sorting uses FieldCache) ... i'm guessing it's probably not documented very well however. There are a bunch of open issues that relate to trying to rectify this problem - but the one known work arround at this point is to not use NumericField, and use StrIndex based FieldCache (which does understand docs w/o values). This is the approach used in Solr for the "Sortable*Field" numerics.
          Hide
          Gilad Barkai added a comment -

          example code

          Show
          Gilad Barkai added a comment - example code

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Gilad Barkai
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development