Issue Details (XML | Word | Printable)

Key: LUCENE-538
Type: Bug Bug
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Helen Warren
Votes: 0
Watchers: 1
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Using WildcardQuery with MultiSearcher, and Boolean MUST_NOT clause

Created: 04/Apr/06 06:32 PM   Updated: 21/Aug/08 12:01 PM
Return to search
Component/s: Search
Affects Version/s: 1.9
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Java Source File TestMultiSearchWildCard.java 2006-04-04 06:33 PM Helen Warren 12 kB
Environment: Ubuntu Linux, java version 1.5.0_04
Issue Links:
Duplicate
 


 Description  « Hide
We are searching across multiple indices using a MultiSearcher. There seems to be a problem when we use a WildcardQuery to exclude documents from the result set. I attach a set of unit tests illustrating the problem.

In these tests, we have two indices. Each index contains a set of documents with fields for 'title', 'section' and 'index'. The final aim is to do a keyword search, across both indices, on the title field and be able to exclude documents from certain sections (and their subsections) using a
WildcardQuery on the section field.

e.g. return documents from both indices which have the string 'xyzpqr' in their title but which do not lie
in the news section or its subsections (section = /news/*).

The first unit test (testExcludeSectionsWildCard) fails trying to do this.
If we relax any of the constraints made above, tests pass:

  • Don't use WildcardQuery, but pass in the news section and it's child section to exclude explicitly (testExcludeSectionsExplicit)</li>
  • Exclude results from just one section, not it's children too i.e. don't use WildcardQuery(testExcludeSingleSection)</li>
  • Do use WildcardQuery, and exclude a section and its children, but just use one index thereby using the simple
    IndexReader and IndexSearcher objects (testExcludeSectionsOneIndex).
  • Try the boolean MUST clause rather than MUST_NOT using the WildcardQuery i.e. only include results from the /news/ section
    and its children.


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Helen Warren added a comment - 04/Apr/06 06:33 PM
Suite of Junit tests illustrating the problem described in this issue.

Helen Warren made changes - 04/Apr/06 06:33 PM
Field Original Value New Value
Attachment TestMultiSearchWildCard.java [ 12324916 ]
Paul Elschot added a comment - 05/Apr/06 04:53 AM
With this code in doSearch():

System.err.println("Executing query: "+overallQuery);
Query qrw = overallQuery.rewrite(reader);
System.err.println("rewritten : "+qrw);
Hits results = searcher.search(qrw);

the test passes.

During searcher.search(), the query is once more rewritten, under the covers.
I don't know why rewriting the overallQuery twice does not work, this may
be a bug.

Anyway, there should be no need to rewrite it explicitly.

For convenience, I put the test in package org.apache.lucene.search,
so I could run the test by:
ant -Dtestcase=TestMultiSearchWildCard test

Regards,
Paul Elschot


Michael Busch added a comment - 21/Nov/06 07:17 PM
The reason for this problem is how the MultiSearcher rewrites queries. It calls rewrite() on all Searchables and combines the rewritten queries thereafter.

And here is the bug:
Lets say we have the query +a -b* and two Searchables. The dictionary of the first Searchable's index has two expansions for b*, so calling rewrite on the first Searchable results in the query +a -(b1 b2). However the dictionary of the second Searchable's index does not have any expansions, so the second rewritten query is +a -(). To combine these two queries the MultiSearcher now creates a new BooleanQuery and adds both rewritten queries as SHOULD clauses, so the combined query looks like: (+a -(b1 b2)) (+a -()). This query is used to search in both indexes. So now all documents that contain 'a' are found, because the negative clause within the second SHOULD clause is empty. Thats why too many results from the first index are returned, the -b* has no effect at all anymore.

The workaround Paul suggested works, because it calls rewrite on MultiReader instead MultiSearcher. Then the b* is expanded using the merged dictionaries from both indexes. So this workaround simply hides the problem in MultiSearcher.


Michael Busch made changes - 31/Dec/07 01:53 PM
Priority Major [ 3 ] Minor [ 4 ]
Mark Miller made changes - 21/Aug/08 12:01 PM
Link This issue is duplicated by LUCENE-1300 [ LUCENE-1300 ]