Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1285

WeightedSpanTermExtractor incorrectly treats the same terms occurring in different query types

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.4
    • 2.4
    • modules/highlighter
    • None
    • New, Patch Available

    Description

      Given a BooleanQuery with multiple clauses, if a term occurs both in a Span / Phrase query, and in a TermQuery, the results of term extraction are unpredictable and depend on the order of clauses. Concequently, the result of highlighting are incorrect.

      Example text: t1 t2 t3 t4 t2
      Example query: t2 t3 "t1 t2"
      Current highlighting: [t1 t2] [t3] t4 t2
      Correct highlighting: [t1 t2] [t3] t4 [t2]

      The problem comes from the fact that we keep a Map<termText, WeightedSpanTerm>, and if the same term occurs in a Phrase or Span query the resulting WeightedSpanTerm will have a positionSensitive=true, whereas terms added from TermQuery have positionSensitive=false. The end result for this particular term will depend on the order in which the clauses are processed.

      My fix is to use a subclass of Map, which on put() always sets the result to the most lax setting, i.e. if we already have a term with positionSensitive=true, and we try to put() a term with positionSensitive=false, we set the result positionSensitive=false, as it will match both cases.

      Attachments

        1. highlighter.patch
          3 kB
          Andrzej Bialecki
        2. highlighter-test.patch
          1 kB
          Mark Miller

        Issue Links

          Activity

            People

              otis Otis Gospodnetic
              ab Andrzej Bialecki
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: