Lucene - Core
  1. Lucene - Core
  2. LUCENE-666

TERM1 OR NOT TERM2 does not perform as expected

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: core/queryparser
    • Labels:
      None
    • Environment:

      Windows XP, JavaCC 4.0, JDK 1.5

      Description

      test:
      [junit] Testsuite: org.apache.lucene.search.TestAornotB
      [junit] Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 0.39 sec

      [junit] ------------- Standard Output ---------------
      [junit] Doc1 = A B C
      [junit] Doc2 = A B C D
      [junit] Doc3 = A C D
      [junit] Doc4 = B C D
      [junit] Doc5 = C D
      [junit] -------------------------------------------------
      [junit] With query "A OR NOT B" we expect to hit
      [junit] all documents EXCEPT Doc4, instead we only match on Doc3.
      [junit] While LUCENE currently explicitly does not support queries of
      [junit] the type "find docs that do not contain TERM" - this explains
      [junit] not finding Doc5, but does not justify elimnating Doc1 and Doc2
      [junit] -------------------------------------------------
      [junit] the fix shoould likely require a modification to QueryParser.jj
      [junit] around the method:
      [junit] protected void addClause(Vector clauses, int conj, int mods, Query q)
      [junit] Query:c:a -c:b hits.length=1
      [junit] Query Found:Doc[0]= A C D
      [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s)
      [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of:
      [junit] 1.0 = tf(termFreq(c:a)=1)
      [junit] 1.2231436 = idf(docFreq=3)
      [junit] 0.5 = fieldNorm(field=c, doc=1)
      [junit] 0.0 = match on prohibited clause (c:b)
      [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 1), product of:
      [junit] 1.0 = tf(termFreq(c:b)=1)
      [junit] 1.2231436 = idf(docFreq=3)
      [junit] 0.5 = fieldNorm(field=c, doc=1)

      [junit] 0.6115718 = (MATCH) sum of:
      [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of:
      [junit] 1.0 = tf(termFreq(c:a)=1)
      [junit] 1.2231436 = idf(docFreq=3)
      [junit] 0.5 = fieldNorm(field=c, doc=2)

      [junit] 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s)
      [junit] 0.0 = match on prohibited clause (c:b)
      [junit] 0.6115718 = (MATCH) fieldWeight(c:b in 3), product of:
      [junit] 1.0 = tf(termFreq(c:b)=1)
      [junit] 1.2231436 = idf(docFreq=3)
      [junit] 0.5 = fieldNorm(field=c, doc=3)

      [junit] Query:c:a (-c:b) hits.length=3
      [junit] Query Found:Doc[0]= A B C
      [junit] Query Found:Doc[1]= A B C D
      [junit] Query Found:Doc[2]= A C D
      [junit] 0.3057859 = (MATCH) product of:
      [junit] 0.6115718 = (MATCH) sum of:
      [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 1), product of:
      [junit] 1.0 = tf(termFreq(c:a)=1)
      [junit] 1.2231436 = idf(docFreq=3)
      [junit] 0.5 = fieldNorm(field=c, doc=1)
      [junit] 0.5 = coord(1/2)

      [junit] 0.3057859 = (MATCH) product of:
      [junit] 0.6115718 = (MATCH) sum of:
      [junit] 0.6115718 = (MATCH) fieldWeight(c:a in 2), product of:
      [junit] 1.0 = tf(termFreq(c:a)=1)
      [junit] 1.2231436 = idf(docFreq=3)
      [junit] 0.5 = fieldNorm(field=c, doc=2)
      [junit] 0.5 = coord(1/2)

      [junit] 0.0 = (NON-MATCH) product of:
      [junit] 0.0 = (NON-MATCH) sum of:
      [junit] 0.0 = coord(0/2)

      [junit] ------------- ---------------- ---------------
      [junit] Testcase: testFAIL(org.apache.lucene.search.TestAornotB): FAILED
      [junit] resultDocs =A C D expected:<3> but was:<1>
      [junit] junit.framework.AssertionFailedError: resultDocs =A C D expected:<3> but was:<1>
      [junit] at org.apache.lucene.search.TestAornotB.testFAIL(TestAornotB.java:137)

      [junit] Test org.apache.lucene.search.TestAornotB FAILED

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Sorry, misunderstood the issue!

          Show
          Uwe Schindler added a comment - Sorry, misunderstood the issue!
          Hide
          Siddharth Gargate added a comment -

          Can we rewrite the query (A OR NOT B) to NOT(NOT(A) AND B) to solve this issue?

          Show
          Siddharth Gargate added a comment - Can we rewrite the query (A OR NOT B) to NOT(NOT(A) AND B) to solve this issue?
          Hide
          Dejan Nenov added a comment -

          For Dejan Nenov - please use this new email address:

          d [at] panaton [dot] com

          After 10 years dejannenov@jollyobject.com is being overwhelmed by spam.


          Dejan Nenov
          mobile: +1-415-999-4450
          dejannenov@jollyobject.com
          dejannenov@rocketmail.com
          Skype: dejannenov

          Show
          Dejan Nenov added a comment - For Dejan Nenov - please use this new email address: d [at] panaton [dot] com After 10 years dejannenov@jollyobject.com is being overwhelmed by spam. – Dejan Nenov mobile: +1-415-999-4450 dejannenov@jollyobject.com dejannenov@rocketmail.com Skype: dejannenov
          Hide
          Dejan Nenov added a comment -

          Any chance this may be addressed in 2.0?

          Show
          Dejan Nenov added a comment - Any chance this may be addressed in 2.0?
          Hide
          Steven Parkes added a comment -

          I think the idea of a parse exception is good. The issue is relatively subtle. A user can relatively easily generate a query that ignores terms. Doing that silently is just asking for people to spend a lot of time trying to understand why they aren't getting something they expect.

          The question is whether it's feasible to detect that an invalid query has been entered. The problem is of course having to ever enumerate all documents. But with multilevel queries (sums of products of sums, etc.), it's not immediately obvious from the parse tree whether that is being attempted or not.

          a(b + !c) is okay but (a+!d)(b + !c) isn't. Basically you can't do a term-by-term analysis of a multilevel query and decide. It's a global property of the query.

          It one reduces the multilevel query to a two level sum-of-products form (i.e., distribute all AND operations over all OR operaitons), you can inspect the result. If any term does not contain a positive term, the query is invalid. But going from multilevel to two level can generate an exponential number of terms. Maybe not an issue in IR, but it makes me nervous.

          I would think that somewhere between query parsing and search execution, the invalidity of a query would have to pop out but I haven't delved into this path enough to know where that might happen.

          If no one's looking at this (generating an exception somewhere) and no one knows of any reason it's not possible, I'll look ...

          On a related topic, and maybe a naive question, has anyone considered keeping a posting that contains all documents? Is it really that hard to do in Lucene today? That would remove this restriction. Perhaps the issue then is that the combination of the number of valid use cases combined with the number of times it would cause people trouble because they were using it invalidly means it's not a good idea? I can come up with some potentially interesting queries, especially when you combine with other features like facets. "Tell me the facet breakdown of all documents that don't contain the word microsoft ..." In many places, !microsoft is not such a terribly big set and can be computed with about the same effort that microsoft can. (Is that true?)

          Show
          Steven Parkes added a comment - I think the idea of a parse exception is good. The issue is relatively subtle. A user can relatively easily generate a query that ignores terms. Doing that silently is just asking for people to spend a lot of time trying to understand why they aren't getting something they expect. The question is whether it's feasible to detect that an invalid query has been entered. The problem is of course having to ever enumerate all documents. But with multilevel queries (sums of products of sums, etc.), it's not immediately obvious from the parse tree whether that is being attempted or not. a(b + !c) is okay but (a+!d)(b + !c) isn't. Basically you can't do a term-by-term analysis of a multilevel query and decide. It's a global property of the query. It one reduces the multilevel query to a two level sum-of-products form (i.e., distribute all AND operations over all OR operaitons), you can inspect the result. If any term does not contain a positive term, the query is invalid. But going from multilevel to two level can generate an exponential number of terms. Maybe not an issue in IR, but it makes me nervous. I would think that somewhere between query parsing and search execution, the invalidity of a query would have to pop out but I haven't delved into this path enough to know where that might happen. If no one's looking at this (generating an exception somewhere) and no one knows of any reason it's not possible, I'll look ... On a related topic, and maybe a naive question, has anyone considered keeping a posting that contains all documents? Is it really that hard to do in Lucene today? That would remove this restriction. Perhaps the issue then is that the combination of the number of valid use cases combined with the number of times it would cause people trouble because they were using it invalidly means it's not a good idea? I can come up with some potentially interesting queries, especially when you combine with other features like facets. "Tell me the facet breakdown of all documents that don't contain the word microsoft ..." In many places, !microsoft is not such a terribly big set and can be computed with about the same effort that microsoft can. (Is that true?)
          Hide
          Hoss Man added a comment -

          A ParseException when
          "A OR NOT B" is found might make sense, but what about "A AND NOT B" ... that (to me anyway) means "+A -B" which is valid.

          Show
          Hoss Man added a comment - A ParseException when "A OR NOT B" is found might make sense, but what about "A AND NOT B" ... that (to me anyway) means "+A -B" which is valid.
          Hide
          Dejan Nenov added a comment -

          I propose that we throw a ParseException - it seems very straightforward to do - any comments / concerns / objections?

          Show
          Dejan Nenov added a comment - I propose that we throw a ParseException - it seems very straightforward to do - any comments / concerns / objections?
          Hide
          Hoss Man added a comment -

          LUCENE-25 specifically relates to stop words, but the large problem is QueryParser generating BooleanQueries containing "invalid" clauses

          Show
          Hoss Man added a comment - LUCENE-25 specifically relates to stop words, but the large problem is QueryParser generating BooleanQueries containing "invalid" clauses
          Hide
          Grant Ingersoll added a comment -

          I am not sure on this, so others should definitely contribute, but here's my take:

          The QueryParser (QP) is not really designed to handle this case and there is some misunderstanding as to what theuse of multiple boolean operators in a single clause.

          I don't think the QP is designed to handle two boolean operators between 2 terms. Thus, the above query really doesn't make sense, but the QP doesn't really prevent it either. Logically, A OR NOT B is equivalent to just searching for A. However, the QP, from what I can tell, only looks at the last operator before the B term and builds it clause based on that, thus building the query A NOT B, which then yields the results as above.

          I am not sure if there is something to fix here other than the documentation. I suppose we could throw a Parse Exception if we detected 2 or more boolean operators between terms, but I am not that familiar with JavaCC, so I am not 100% sure.

          Show
          Grant Ingersoll added a comment - I am not sure on this, so others should definitely contribute, but here's my take: The QueryParser (QP) is not really designed to handle this case and there is some misunderstanding as to what theuse of multiple boolean operators in a single clause. I don't think the QP is designed to handle two boolean operators between 2 terms. Thus, the above query really doesn't make sense, but the QP doesn't really prevent it either. Logically, A OR NOT B is equivalent to just searching for A. However, the QP, from what I can tell, only looks at the last operator before the B term and builds it clause based on that, thus building the query A NOT B, which then yields the results as above. I am not sure if there is something to fix here other than the documentation. I suppose we could throw a Parse Exception if we detected 2 or more boolean operators between terms, but I am not that familiar with JavaCC, so I am not 100% sure.

            People

            • Assignee:
              Unassigned
              Reporter:
              Dejan Nenov
            • Votes:
              4 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:

                Development