Lucene - Core
  1. Lucene - Core
  2. LUCENE-2035

TokenSources.getTokenStream() does not assign positionIncrement

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1, 2.9
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.

      For example:
      Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
      When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped

      Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.

      Also, consider the converse - the fox did not jump
      "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
      When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).

      So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

      1. LUCENE-2035.patch
        40 kB
        Mark Miller
      2. LUCENE-2035.patch
        20 kB
        Mark Miller
      3. LUCENE-2305.patch
        20 kB
        Christopher Morris

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Christopher Morris
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development