Lucene - Core
  1. Lucene - Core
  2. LUCENE-2035

TokenSources.getTokenStream() does not assign positionIncrement

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4, 2.4.1, 2.9
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.

      For example:
      Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
      When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped

      Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.

      Also, consider the converse - the fox did not jump
      "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
      When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).

      So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

      1. LUCENE-2035.patch
        40 kB
        Mark Miller
      2. LUCENE-2035.patch
        20 kB
        Mark Miller
      3. LUCENE-2305.patch
        20 kB
        Christopher Morris

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        62d 5h 28m 1 Mark Miller 06/Jan/10 19:08
        Resolved Resolved Reopened Reopened
        295d 18h 2m 1 Robert Muir 29/Oct/10 14:11
        Reopened Reopened Resolved Resolved
        29d 10h 13m 1 Uwe Schindler 27/Nov/10 23:25
        Resolved Resolved Closed Closed
        122d 16h 24m 1 Grant Ingersoll 30/Mar/11 16:50
        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12564679 ] jira [ 12585795 ]
        Mark Thomas made changes -
        Workflow jira [ 12481293 ] Default workflow, editable Closed status [ 12564679 ]
        Uwe Schindler made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Fix Version/s 3.0.3 [ 12315147 ]
        Fix Version/s 2.9.4 [ 12315148 ]
        Hide
        Uwe Schindler added a comment -

        Resolving again as this issue will not be backported to 2.9/3.0 branches.

        Show
        Uwe Schindler added a comment - Resolving again as this issue will not be backported to 2.9/3.0 branches.
        Robert Muir made changes -
        Fix Version/s 2.9.4 [ 12315148 ]
        Fix Version/s 3.0.3 [ 12315147 ]
        Fix Version/s 3.1 [ 12314822 ]
        Robert Muir made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        Robert Muir added a comment -

        reopening for possible 2.9.4/3.0.3 backport.

        Show
        Robert Muir added a comment - reopening for possible 2.9.4/3.0.3 backport.
        Mark Miller made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Mark Miller added a comment -

        Thanks Christopher!

        Show
        Mark Miller added a comment - Thanks Christopher!
        Hide
        Christopher Morris added a comment -

        Cheers Mark,

        The custom collector was probably because I was learning the new API at the time.

        The only changes I've made since the patch I submitted were to initialise the ArrayList with tpv.getTerms().length because that represents the minimum size that the list will grow to, and to replace the List and Iterator fields with an array (derived from the list) and an integer pointer. Both of which are probably unnecessary.

        The tests could be improved - the first case could be fixed in it's present form by using the Analyzer to generate the phrase query. If the stemmed word was the middle word of the phrase then that fix wouldn't work.

        Show
        Christopher Morris added a comment - Cheers Mark, The custom collector was probably because I was learning the new API at the time. The only changes I've made since the patch I submitted were to initialise the ArrayList with tpv.getTerms().length because that represents the minimum size that the list will grow to, and to replace the List and Iterator fields with an array (derived from the list) and an integer pointer. Both of which are probably unnecessary. The tests could be improved - the first case could be fixed in it's present form by using the Analyzer to generate the phrase query. If the stemmed word was the middle word of the phrase then that fix wouldn't work.
        Hide
        Mark Miller added a comment -

        I'll commit this soon.

        Show
        Mark Miller added a comment - I'll commit this soon.
        Mark Miller made changes -
        Attachment LUCENE-2035.patch [ 12428241 ]
        Hide
        Mark Miller added a comment -

        I've broken the new tests back out into there own file, change the hit collector code to just search basically, and improved the test coverage of TokenSources a bit.

        Show
        Mark Miller added a comment - I've broken the new tests back out into there own file, change the hit collector code to just search basically, and improved the test coverage of TokenSources a bit.
        Hide
        Mark Miller added a comment -

        Hey Christopher, why are you going through the trouble of the custom collector to check that there are no hits? Why not just do a standard search?

        Show
        Mark Miller added a comment - Hey Christopher, why are you going through the trouble of the custom collector to check that there are no hits? Why not just do a standard search?
        Hide
        Mark Miller added a comment -

        Thanks for the tests and fix Christopher!

        I've got one more patch coming and ill commit in a few days.

        I'm going to break the tests back out in a separate file again (on second thought I think how you had is a good idea) and remove an author tag. Then after one more review I think this good to go in.

        Show
        Mark Miller added a comment - Thanks for the tests and fix Christopher! I've got one more patch coming and ill commit in a few days. I'm going to break the tests back out in a separate file again (on second thought I think how you had is a good idea) and remove an author tag. Then after one more review I think this good to go in.
        Mark Miller made changes -
        Attachment LUCENE-2035.patch [ 12428123 ]
        Mark Miller made changes -
        Fix Version/s 3.1 [ 12314025 ]
        Mark Miller made changes -
        Assignee Mark Miller [ markrmiller@gmail.com ]
        Christopher Morris made changes -
        Field Original Value New Value
        Attachment LUCENE-2305.patch [ 12424126 ]
        Hide
        Christopher Morris added a comment -

        For the highlighter trunk

        Show
        Christopher Morris added a comment - For the highlighter trunk
        Christopher Morris created issue -

          People

          • Assignee:
            Mark Miller
            Reporter:
            Christopher Morris
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 24h
              24h
              Remaining:
              Remaining Estimate - 24h
              24h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development