Lucene - Core
  1. Lucene - Core
  2. LUCENE-3609

BooleanFilter changed behavior in 3.5, no longer acts as if "minimum should match" set to 1

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The change LUCENE-3446 causes a change in behavior in BooleanFilter. It used to work as if minimum should match clauses is 1 (compared to BQ lingo), but now, if no should clauses match, then the should clauses are ignored, and for example, if there is a must clause, only that one will be used and returned.

      For example, a single must clause and should clause, with the should clause not matching anything, should not match anything, but, it will match whatever the must clause matches.

      The fix is simple, after iterating over the should clauses, if the aggregated bitset is null, return null.

      1. LUCENE-3609.patch
        0.8 kB
        Uwe Schindler
      2. LUCENE-3609.patch
        1.0 kB
        Uwe Schindler
      3. LUCENE-3609.patch
        3 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Committed trunk revision: 1208381

        Show
        Uwe Schindler added a comment - Committed trunk revision: 1208381
        Hide
        Uwe Schindler added a comment -

        Committed 3.x revision: 1208375

        Now forward-porting

        Show
        Uwe Schindler added a comment - Committed 3.x revision: 1208375 Now forward-porting
        Hide
        Uwe Schindler added a comment -

        Path with testcase testing all special cases (all patches for 3.x)

        Show
        Uwe Schindler added a comment - Path with testcase testing all special cases (all patches for 3.x)
        Hide
        Uwe Schindler added a comment -

        Here the fix with short-circuit.

        Show
        Uwe Schindler added a comment - Here the fix with short-circuit.
        Hide
        Uwe Schindler added a comment -

        not sure why the bool filter does not return null if it match nothing

        This does not matter, processing is the same. DocIdSet.EMPTY_DOCIDSET has same effect like null in Lucene's internals (there are checks handling those special values). In my opinion we should disallow null as return value in filters completely.

        The attached patch is the easy fix that does exactly the same like before, but it's indeed less efficient as it would return an empty FixedBitSet. So a shortcut would be nice.

        It can of course still happen that a clause returns an empty BitSet, but then the code would still work correct (but without short circuit).

        Show
        Uwe Schindler added a comment - not sure why the bool filter does not return null if it match nothing This does not matter, processing is the same. DocIdSet.EMPTY_DOCIDSET has same effect like null in Lucene's internals (there are checks handling those special values). In my opinion we should disallow null as return value in filters completely. The attached patch is the easy fix that does exactly the same like before, but it's indeed less efficient as it would return an empty FixedBitSet. So a shortcut would be nice. It can of course still happen that a clause returns an empty BitSet, but then the code would still work correct (but without short circuit).
        Hide
        Uwe Schindler added a comment -

        This is the easy patch. We still need a test, but it fixes the behaviour change.

        Show
        Uwe Schindler added a comment - This is the easy patch. We still need a test, but it fixes the behaviour change.
        Hide
        Shay Banon added a comment -

        I don't think this is the best fix, since null values for empty values allows for early exit and less processing (not sure why the bool filter does not return null if it match nothing). Why not just implement the fix I suggested?

        Show
        Shay Banon added a comment - I don't think this is the best fix, since null values for empty values allows for early exit and less processing (not sure why the bool filter does not return null if it match nothing). Why not just implement the fix I suggested?
        Hide
        Uwe Schindler added a comment -

        I agree there is something wrong. The filter logic should change to return DocIdSet.EMPTY_DOCIDSET.iterator() in getDISI. The null check can then go and the behaviour is correct again.
        The problem here only occurs if a filter returns the emoty instance or null. If it returns an empty BitSet it behaves as before.

        Show
        Uwe Schindler added a comment - I agree there is something wrong. The filter logic should change to return DocIdSet.EMPTY_DOCIDSET.iterator() in getDISI. The null check can then go and the behaviour is correct again. The problem here only occurs if a filter returns the emoty instance or null. If it returns an empty BitSet it behaves as before.
        Hide
        Shay Banon added a comment - - edited

        What I am saying is that BooleanFilter used to act in a way that at least one should clause should match, and it is no longer the case. Here is the logic that was before:

        if (shouldFilters != null) {
              for (int i = 0; i < shouldFilters.size(); i++) {
                if (res == null) {
                  res = new OpenBitSetDISI(getDISI(shouldFilters, i, reader), reader.maxDoc());
                } else { 
                  DocIdSet dis = shouldFilters.get(i).getDocIdSet(reader);
                  if(dis instanceof OpenBitSet) {
                    // optimized case for OpenBitSets
                    res.or((OpenBitSet) dis);
                  } else {
                    res.inPlaceOr(getDISI(shouldFilters, i, reader));
                  }
                }
              }
            }
        

        Assuming the getDISI returns EMTY iterator for a filter that does not match (and not null, as it will fail) for a single should clause, then the result of this will be a "res" all "zeroed" out (the first check on res==null). Then, if it went ahead and executed a must clause, it would and on a "zeroed" out bitset and the result is no matches.

        Now, with the change, we have this code:

        for (final FilterClause fc : clauses) {
          if (fc.getOccur() == Occur.SHOULD) {
            final DocIdSetIterator disi = getDISI(fc.getFilter(), reader);
            if (disi == null) continue;
            if (res == null) {
              res = new FixedBitSet(reader.maxDoc());
            }
            res.or(disi);
          }
        }
        

        The result of a single should clause that does not match anything is a res still set to null, and then, when it gets to the must clause, it will or it with the result of the must clause, and return the docs that match the must clause. You can see this is different compared to the previous behavior and actually, different than the expected behavior.

        [Update]: And the fix should be to return null only if res is null and should clauses count is higher than 0 after the check for should clause count.

        Show
        Shay Banon added a comment - - edited What I am saying is that BooleanFilter used to act in a way that at least one should clause should match, and it is no longer the case. Here is the logic that was before: if (shouldFilters != null ) { for ( int i = 0; i < shouldFilters.size(); i++) { if (res == null ) { res = new OpenBitSetDISI(getDISI(shouldFilters, i, reader), reader.maxDoc()); } else { DocIdSet dis = shouldFilters.get(i).getDocIdSet(reader); if (dis instanceof OpenBitSet) { // optimized case for OpenBitSets res.or((OpenBitSet) dis); } else { res.inPlaceOr(getDISI(shouldFilters, i, reader)); } } } } Assuming the getDISI returns EMTY iterator for a filter that does not match (and not null, as it will fail) for a single should clause, then the result of this will be a "res" all "zeroed" out (the first check on res==null). Then, if it went ahead and executed a must clause, it would and on a "zeroed" out bitset and the result is no matches. Now, with the change, we have this code: for ( final FilterClause fc : clauses) { if (fc.getOccur() == Occur.SHOULD) { final DocIdSetIterator disi = getDISI(fc.getFilter(), reader); if (disi == null ) continue ; if (res == null ) { res = new FixedBitSet(reader.maxDoc()); } res.or(disi); } } The result of a single should clause that does not match anything is a res still set to null, and then, when it gets to the must clause, it will or it with the result of the must clause, and return the docs that match the must clause. You can see this is different compared to the previous behavior and actually, different than the expected behavior. [Update] : And the fix should be to return null only if res is null and should clauses count is higher than 0 after the check for should clause count.
        Hide
        Uwe Schindler added a comment - - edited

        Shay,

        there was no real change caused by LUCENE-3446 or LUCENE-3458, the logic is identical before and after. To be sure I will write a test but if you look at the patch it will not change behaviour. The minShouldMatch logic was never implemented in BooleanFilter.

        There was one small "bug" in the filter before. It handled the case that a filter clause returned null different than the case if the clause returned an empty bitset/DocIdSet.EMPTY_DOCIDSET. So the whole thing was broken before as it was not consistent. Now it behaves exactly as Robert told.

        The minShouldMatch logic was caused by different behaviour on clauses returning null instead DocIdSet.EMPTY_DOCIDSET.

        Show
        Uwe Schindler added a comment - - edited Shay, there was no real change caused by LUCENE-3446 or LUCENE-3458 , the logic is identical before and after. To be sure I will write a test but if you look at the patch it will not change behaviour. The minShouldMatch logic was never implemented in BooleanFilter. There was one small "bug" in the filter before. It handled the case that a filter clause returned null different than the case if the clause returned an empty bitset/DocIdSet.EMPTY_DOCIDSET. So the whole thing was broken before as it was not consistent. Now it behaves exactly as Robert told. The minShouldMatch logic was caused by different behaviour on clauses returning null instead DocIdSet.EMPTY_DOCIDSET.
        Hide
        Robert Muir added a comment -

        For example, a single must clause and should clause, with the should clause not matching anything, should not match anything, but, it will match whatever the must clause matches.

        Wait, this sounds correct.

        If you have a MUST clause and a SHOULD clause, then the SHOULD clause is totally irrelevant (from boolean logic).

        Show
        Robert Muir added a comment - For example, a single must clause and should clause, with the should clause not matching anything, should not match anything, but, it will match whatever the must clause matches. Wait, this sounds correct. If you have a MUST clause and a SHOULD clause, then the SHOULD clause is totally irrelevant (from boolean logic).
        Hide
        Uwe Schindler added a comment -

        I will look into this.

        Show
        Uwe Schindler added a comment - I will look into this.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Shay Banon
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development