Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1224

NGramTokenFilter creates bad TokenStream

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.3
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      With current trunk NGramTokenFilter(min=2,max=4) , I index "abcdef" string into an index, but I can't query it with "abc". If I query with "ab", I can get a hit result.

      The reason is that the NGramTokenFilter generates badly ordered TokenStream. Query is based on the Token order in the TokenStream, that how stemming or phrase should be anlayzed is based on the order (Token.positionIncrement).

      With current filter, query string "abc" is tokenized to : ab bc abc
      meaning "query a string that has ab bc abc in this order".
      Expected filter will generate : ab abc(positionIncrement=0) bc
      meaning "query a string that has (ab|abc) bc in this order"

      I'd like to submit a patch for this issue.

        Attachments

        1. NGramTokenFilter.patch
          1 kB
          Hiroaki Kawai
        2. NGramTokenFilter.patch
          1 kB
          Hiroaki Kawai
        3. LUCENE-1224.patch
          9 kB
          Hiroaki Kawai

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kawai Hiroaki Kawai
              • Votes:
                2 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: