Issue Details (XML | Word | Printable)

Key: LUCENE-1285
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Otis Gospodnetic
Reporter: Andrzej Bialecki
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

Created: 15/May/08 01:01 PM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: contrib/highlighter
Affects Version/s: 2.4
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works highlighter-test.patch 2008-05-15 01:51 PM Mark Miller 1 kB
Text File Licensed for inclusion in ASF works highlighter.patch 2008-05-15 01:14 PM Andrzej Bialecki 3 kB
Issue Links:
Reference
 

Lucene Fields: New, Patch Available
Resolution Date: 27/May/08 04:10 PM


 Description  « Hide
Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / Phrase query, and in a TermQuery, the results of term extraction are unpredictable and depend on the order of clauses. Concequently, the result of highlighting are incorrect.

Example text: t1 t2 t3 t4 t2
Example query: t2 t3 "t1 t2"
Current highlighting: [t1 t2] [t3] t4 t2
Correct highlighting: [t1 t2] [t3] t4 [t2]

The problem comes from the fact that we keep a Map<termText, WeightedSpanTerm>, and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms added from TermQuery have positionSensitive=false. The end result for this particular term will depend on the order in which the clauses are processed.

My fix is to use a subclass of Map, which on put() always sets the result to the most lax setting, i.e. if we already have a term with positionSensitive=true, and we try to put() a term with positionSensitive=false, we set the result positionSensitive=false, as it will match both cases.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Andrzej Bialecki made changes - 15/May/08 01:13 PM
Field Original Value New Value
Summary WeightedSpanTermExtractor doesn' WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types
Description Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / Phrase query, and in a TermQuery, the results of term extraction are unpredictable and depend on the order of clauses. Concequently, the result of highlighting are incorrect.

Example text: t1 t2 t3 t4 t2
Example query: t2 t3 "t1 t2"
Current highlighting: [t1 t2] [t3] t4 t2
Correct highlighting: [t1 t2] [t3] t4 [t2]

The problem comes from the fact that we keep a Map<termText, WeightedSpanTerm>, and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms added from TermQuery have positionSensitive=false. The end result for this particular term will depend on the order in which the clauses are processed.

My fix is to use a subclass of Map, which on put() always sets the result to the most lax setting, i.e. if we already have a term with positionSensitive=true, and we try to put() a term with positionSensitive=false, we set the result positionSensitive=false, as it will match both cases.
Andrzej Bialecki added a comment - 15/May/08 01:14 PM
A patch to fix the issue.

Andrzej Bialecki made changes - 15/May/08 01:14 PM
Attachment highlighter.patch [ 12382109 ]
Mark Miller added a comment - 15/May/08 01:32 PM
Nice catch and the fix looks great.

Thanks Andrzej.


Mark Miller added a comment - 15/May/08 01:51 PM
Test that exposes the problem. The posted patch makes the test pass.
  • Mark

Mark Miller made changes - 15/May/08 01:51 PM
Attachment highlighter-test.patch [ 12382111 ]
Otis Gospodnetic made changes - 17/May/08 01:43 AM
Link This issue relates to SOLR-553 [ SOLR-553 ]
Otis Gospodnetic added a comment - 20/May/08 07:22 PM
Mark, are you done with this/would you like to commit this? Or should I? (Asking because of SOLR-553)

Otis Gospodnetic made changes - 21/May/08 04:12 AM
Assignee Otis Gospodnetic [ otis ]
Repository Revision Date User Message
ASF #659965 Sun May 25 11:38:55 UTC 2008 markrmiller LUCENE-1285: WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types
Files Changed
MODIFY /lucene/java/trunk/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java
MODIFY /lucene/java/trunk/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java

Mark Miller added a comment - 25/May/08 11:40 AM
Just had a go at committing this. Looks good to me.

Otis Gospodnetic added a comment - 27/May/08 04:10 PM
It looks like Mark already committed this, but forgot resolve this issue, so I'm marking it as Fixed.

Otis Gospodnetic made changes - 27/May/08 04:10 PM
Lucene Fields [New] [New, Patch Available]
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Michael McCandless made changes - 11/Oct/08 12:49 PM
Status Resolved [ 5 ] Closed [ 6 ]