Lucene - Core
  1. Lucene - Core
  2. LUCENE-538

Using WildcardQuery with MultiSearcher, and Boolean MUST_NOT clause

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 3.1
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Ubuntu Linux, java version 1.5.0_04

      Description

      We are searching across multiple indices using a MultiSearcher. There seems to be a problem when we use a WildcardQuery to exclude documents from the result set. I attach a set of unit tests illustrating the problem.

      In these tests, we have two indices. Each index contains a set of documents with fields for 'title', 'section' and 'index'. The final aim is to do a keyword search, across both indices, on the title field and be able to exclude documents from certain sections (and their subsections) using a
      WildcardQuery on the section field.

      e.g. return documents from both indices which have the string 'xyzpqr' in their title but which do not lie
      in the news section or its subsections (section = /news/*).

      The first unit test (testExcludeSectionsWildCard) fails trying to do this.
      If we relax any of the constraints made above, tests pass:

      • Don't use WildcardQuery, but pass in the news section and it's child section to exclude explicitly (testExcludeSectionsExplicit)</li>
      • Exclude results from just one section, not it's children too i.e. don't use WildcardQuery(testExcludeSingleSection)</li>
      • Do use WildcardQuery, and exclude a section and its children, but just use one index thereby using the simple
        IndexReader and IndexSearcher objects (testExcludeSectionsOneIndex).
      • Try the boolean MUST clause rather than MUST_NOT using the WildcardQuery i.e. only include results from the /news/ section
        and its children.

        Issue Links

          Activity

          Hide
          Helen Warren added a comment -

          Suite of Junit tests illustrating the problem described in this issue.

          Show
          Helen Warren added a comment - Suite of Junit tests illustrating the problem described in this issue.
          Hide
          Paul Elschot added a comment -

          With this code in doSearch():

          System.err.println("Executing query: "+overallQuery);
          Query qrw = overallQuery.rewrite(reader);
          System.err.println("rewritten : "+qrw);
          Hits results = searcher.search(qrw);

          the test passes.

          During searcher.search(), the query is once more rewritten, under the covers.
          I don't know why rewriting the overallQuery twice does not work, this may
          be a bug.

          Anyway, there should be no need to rewrite it explicitly.

          For convenience, I put the test in package org.apache.lucene.search,
          so I could run the test by:
          ant -Dtestcase=TestMultiSearchWildCard test

          Regards,
          Paul Elschot

          Show
          Paul Elschot added a comment - With this code in doSearch(): System.err.println("Executing query: "+overallQuery); Query qrw = overallQuery.rewrite(reader); System.err.println("rewritten : "+qrw); Hits results = searcher.search(qrw); the test passes. During searcher.search(), the query is once more rewritten, under the covers. I don't know why rewriting the overallQuery twice does not work, this may be a bug. Anyway, there should be no need to rewrite it explicitly. For convenience, I put the test in package org.apache.lucene.search, so I could run the test by: ant -Dtestcase=TestMultiSearchWildCard test Regards, Paul Elschot
          Hide
          Michael Busch added a comment -

          The reason for this problem is how the MultiSearcher rewrites queries. It calls rewrite() on all Searchables and combines the rewritten queries thereafter.

          And here is the bug:
          Lets say we have the query +a -b* and two Searchables. The dictionary of the first Searchable's index has two expansions for b*, so calling rewrite on the first Searchable results in the query +a -(b1 b2). However the dictionary of the second Searchable's index does not have any expansions, so the second rewritten query is +a -(). To combine these two queries the MultiSearcher now creates a new BooleanQuery and adds both rewritten queries as SHOULD clauses, so the combined query looks like: (+a -(b1 b2)) (+a -()). This query is used to search in both indexes. So now all documents that contain 'a' are found, because the negative clause within the second SHOULD clause is empty. Thats why too many results from the first index are returned, the -b* has no effect at all anymore.

          The workaround Paul suggested works, because it calls rewrite on MultiReader instead MultiSearcher. Then the b* is expanded using the merged dictionaries from both indexes. So this workaround simply hides the problem in MultiSearcher.

          Show
          Michael Busch added a comment - The reason for this problem is how the MultiSearcher rewrites queries. It calls rewrite() on all Searchables and combines the rewritten queries thereafter. And here is the bug: Lets say we have the query +a -b* and two Searchables. The dictionary of the first Searchable's index has two expansions for b*, so calling rewrite on the first Searchable results in the query +a -(b1 b2). However the dictionary of the second Searchable's index does not have any expansions, so the second rewritten query is +a -(). To combine these two queries the MultiSearcher now creates a new BooleanQuery and adds both rewritten queries as SHOULD clauses, so the combined query looks like: (+a -(b1 b2)) (+a -()). This query is used to search in both indexes. So now all documents that contain 'a' are found, because the negative clause within the second SHOULD clause is empty. Thats why too many results from the first index are returned, the -b* has no effect at all anymore. The workaround Paul suggested works, because it calls rewrite on MultiReader instead MultiSearcher. Then the b* is expanded using the merged dictionaries from both indexes. So this workaround simply hides the problem in MultiSearcher.
          Hide
          Mark Miller added a comment -

          So my dream is to remove Remote from contrib and fix this issue

          Show
          Mark Miller added a comment - So my dream is to remove Remote from contrib and fix this issue
          Hide
          Robert Muir added a comment -

          This is now fixed by Mike's cleanup to MultiSearcher etc, which fixes this combine/rewrite bug

          Show
          Robert Muir added a comment - This is now fixed by Mike's cleanup to MultiSearcher etc, which fixes this combine/rewrite bug
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1

            People

            • Assignee:
              Unassigned
              Reporter:
              Helen Warren
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development