Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8509

NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 8.0
    • None
    • None
    • New

    Description

      Discovered by an elasticsearch user and described here: https://github.com/elastic/elasticsearch/issues/33710

      The ngram tokenizer produces tokens "a b" and " bb" (note the space at the beginning of the second token). The WDGF takes the first token and splits it into two, adjusting the offsets of the second token, so we get "a"[0,1] and "b"[2,3]. The trim filter removes the leading space from the second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space has already been stripped, WDGF sees no need to adjust offsets, and emits the token as-is, resulting in the start offsets of the tokenstream being [0, 2, 1], and the IndexWriter rejecting it.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            romseygeek Alan Woodward
            romseygeek Alan Woodward
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment