Lucene - Core
  1. Lucene - Core
  2. LUCENE-538

Using WildcardQuery with MultiSearcher, and Boolean MUST_NOT clause

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 3.1
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Ubuntu Linux, java version 1.5.0_04

      Description

      We are searching across multiple indices using a MultiSearcher. There seems to be a problem when we use a WildcardQuery to exclude documents from the result set. I attach a set of unit tests illustrating the problem.

      In these tests, we have two indices. Each index contains a set of documents with fields for 'title', 'section' and 'index'. The final aim is to do a keyword search, across both indices, on the title field and be able to exclude documents from certain sections (and their subsections) using a
      WildcardQuery on the section field.

      e.g. return documents from both indices which have the string 'xyzpqr' in their title but which do not lie
      in the news section or its subsections (section = /news/*).

      The first unit test (testExcludeSectionsWildCard) fails trying to do this.
      If we relax any of the constraints made above, tests pass:

      • Don't use WildcardQuery, but pass in the news section and it's child section to exclude explicitly (testExcludeSectionsExplicit)</li>
      • Exclude results from just one section, not it's children too i.e. don't use WildcardQuery(testExcludeSingleSection)</li>
      • Do use WildcardQuery, and exclude a section and its children, but just use one index thereby using the simple
        IndexReader and IndexSearcher objects (testExcludeSectionsOneIndex).
      • Try the boolean MUST clause rather than MUST_NOT using the WildcardQuery i.e. only include results from the /news/ section
        and its children.

        Issue Links

          Activity

          Helen Warren created issue -
          Hide
          Helen Warren added a comment -

          Suite of Junit tests illustrating the problem described in this issue.

          Show
          Helen Warren added a comment - Suite of Junit tests illustrating the problem described in this issue.
          Helen Warren made changes -
          Field Original Value New Value
          Attachment TestMultiSearchWildCard.java [ 12324916 ]
          Hide
          Paul Elschot added a comment -

          With this code in doSearch():

          System.err.println("Executing query: "+overallQuery);
          Query qrw = overallQuery.rewrite(reader);
          System.err.println("rewritten : "+qrw);
          Hits results = searcher.search(qrw);

          the test passes.

          During searcher.search(), the query is once more rewritten, under the covers.
          I don't know why rewriting the overallQuery twice does not work, this may
          be a bug.

          Anyway, there should be no need to rewrite it explicitly.

          For convenience, I put the test in package org.apache.lucene.search,
          so I could run the test by:
          ant -Dtestcase=TestMultiSearchWildCard test

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - With this code in doSearch(): System.err.println("Executing query: "+overallQuery); Query qrw = overallQuery.rewrite(reader); System.err.println("rewritten : "+qrw); Hits results = searcher.search(qrw); the test passes. During searcher.search(), the query is once more rewritten, under the covers. I don't know why rewriting the overallQuery twice does not work, this may be a bug. Anyway, there should be no need to rewrite it explicitly. For convenience, I put the test in package org.apache.lucene.search, so I could run the test by: ant -Dtestcase=TestMultiSearchWildCard test Regards, Paul Elschot
          Hide
          Michael Busch added a comment -

          The reason for this problem is how the MultiSearcher rewrites queries. It calls rewrite() on all Searchables and combines the rewritten queries thereafter.

          And here is the bug:
          Lets say we have the query +a -b* and two Searchables. The dictionary of the first Searchable's index has two expansions for b*, so calling rewrite on the first Searchable results in the query +a -(b1 b2). However the dictionary of the second Searchable's index does not have any expansions, so the second rewritten query is +a -(). To combine these two queries the MultiSearcher now creates a new BooleanQuery and adds both rewritten queries as SHOULD clauses, so the combined query looks like: (+a -(b1 b2)) (+a -()). This query is used to search in both indexes. So now all documents that contain 'a' are found, because the negative clause within the second SHOULD clause is empty. Thats why too many results from the first index are returned, the -b* has no effect at all anymore.

          The workaround Paul suggested works, because it calls rewrite on MultiReader instead MultiSearcher. Then the b* is expanded using the merged dictionaries from both indexes. So this workaround simply hides the problem in MultiSearcher.

          Show
          Michael Busch added a comment - The reason for this problem is how the MultiSearcher rewrites queries. It calls rewrite() on all Searchables and combines the rewritten queries thereafter. And here is the bug: Lets say we have the query +a -b* and two Searchables. The dictionary of the first Searchable's index has two expansions for b*, so calling rewrite on the first Searchable results in the query +a -(b1 b2). However the dictionary of the second Searchable's index does not have any expansions, so the second rewritten query is +a -(). To combine these two queries the MultiSearcher now creates a new BooleanQuery and adds both rewritten queries as SHOULD clauses, so the combined query looks like: (+a -(b1 b2)) (+a -()). This query is used to search in both indexes. So now all documents that contain 'a' are found, because the negative clause within the second SHOULD clause is empty. Thats why too many results from the first index are returned, the -b* has no effect at all anymore. The workaround Paul suggested works, because it calls rewrite on MultiReader instead MultiSearcher. Then the b* is expanded using the merged dictionaries from both indexes. So this workaround simply hides the problem in MultiSearcher.
          Michael Busch made changes -
          Priority Major [ 3 ] Minor [ 4 ]
          Mark Miller made changes -
          Link This issue is duplicated by LUCENE-1300 [ LUCENE-1300 ]
          Hide
          Mark Miller added a comment -

          So my dream is to remove Remote from contrib and fix this issue

          Show
          Mark Miller added a comment - So my dream is to remove Remote from contrib and fix this issue
          Hide
          Robert Muir added a comment -

          This is now fixed by Mike's cleanup to MultiSearcher etc, which fixes this combine/rewrite bug

          Show
          Robert Muir added a comment - This is now fixed by Mike's cleanup to MultiSearcher etc, which fixes this combine/rewrite bug
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.1 [ 12314822 ]
          Resolution Fixed [ 1 ]
          Mark Thomas made changes -
          Workflow jira [ 12353310 ] Default workflow, editable Closed status [ 12564025 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12564025 ] jira [ 12585497 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          1756d 18h 49m 1 Robert Muir 25/Jan/11 13:21
          Resolved Resolved Closed Closed
          64d 2h 28m 1 Grant Ingersoll 30/Mar/11 16:50

            People

            • Assignee:
              Unassigned
              Reporter:
              Helen Warren
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development