Issue Details (XML | Word | Printable)

Key: LUCENE-1285
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Otis Gospodnetic
Reporter: Andrzej Bialecki
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

Created: 15/May/08 01:01 PM   Updated: 11/Oct/08 12:49 PM
Return to search
Component/s: contrib/highlighter
Affects Version/s: 2.4
Fix Version/s: 2.4

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works highlighter-test.patch 2008-05-15 01:51 PM Mark Miller 1 kB
Text File Licensed for inclusion in ASF works highlighter.patch 2008-05-15 01:14 PM Andrzej Bialecki 3 kB
Issue Links:
Reference
 

Lucene Fields: New, Patch Available
Resolution Date: 27/May/08 04:10 PM


 Description  « Hide
Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / Phrase query, and in a TermQuery, the results of term extraction are unpredictable and depend on the order of clauses. Concequently, the result of highlighting are incorrect.

Example text: t1 t2 t3 t4 t2
Example query: t2 t3 "t1 t2"
Current highlighting: [t1 t2] [t3] t4 t2
Correct highlighting: [t1 t2] [t3] t4 [t2]

The problem comes from the fact that we keep a Map<termText, WeightedSpanTerm>, and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms added from TermQuery have positionSensitive=false. The end result for this particular term will depend on the order in which the clauses are processed.

My fix is to use a subclass of Map, which on put() always sets the result to the most lax setting, i.e. if we already have a term with positionSensitive=true, and we try to put() a term with positionSensitive=false, we set the result positionSensitive=false, as it will match both cases.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Andrzej Bialecki made changes - 15/May/08 01:13 PM
Field Original Value New Value
Summary WeightedSpanTermExtractor doesn' WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types
Description Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / Phrase query, and in a TermQuery, the results of term extraction are unpredictable and depend on the order of clauses. Concequently, the result of highlighting are incorrect.

Example text: t1 t2 t3 t4 t2
Example query: t2 t3 "t1 t2"
Current highlighting: [t1 t2] [t3] t4 t2
Correct highlighting: [t1 t2] [t3] t4 [t2]

The problem comes from the fact that we keep a Map<termText, WeightedSpanTerm>, and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms added from TermQuery have positionSensitive=false. The end result for this particular term will depend on the order in which the clauses are processed.

My fix is to use a subclass of Map, which on put() always sets the result to the most lax setting, i.e. if we already have a term with positionSensitive=true, and we try to put() a term with positionSensitive=false, we set the result positionSensitive=false, as it will match both cases.
Andrzej Bialecki made changes - 15/May/08 01:14 PM
Attachment highlighter.patch [ 12382109 ]
Mark Miller made changes - 15/May/08 01:51 PM
Attachment highlighter-test.patch [ 12382111 ]
Otis Gospodnetic made changes - 17/May/08 01:43 AM
Link This issue relates to SOLR-553 [ SOLR-553 ]
Otis Gospodnetic made changes - 21/May/08 04:12 AM
Assignee Otis Gospodnetic [ otis ]
Otis Gospodnetic made changes - 27/May/08 04:10 PM
Lucene Fields [New] [New, Patch Available]
Resolution Fixed [ 1 ]
Status Open [ 1 ] Resolved [ 5 ]
Michael McCandless made changes - 11/Oct/08 12:49 PM
Status Resolved [ 5 ] Closed [ 6 ]