Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-5042

Improve NGramTokenizer

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 4.4, 6.0
    • None
    • None
    • New

    Description

      Now that we fixed NGramTokenizer and NGramTokenFilter to not produce corrupt token streams, the only way to have "true" offsets for n-grams is to use the tokenizer (the filter emits the offsets of the original token).

      Yet, our NGramTokenizer has a few flaws, in particular:

      • it doesn't have the ability to pre-tokenize the input stream, for example on whitespaces,
      • it doesn't play nice with surrogate pairs.

      Since we already broke backward compatibility for it in 4.4, I'd like to also fix these issues before we release.

      Attachments

        1. LUCENE-5042.patch
          16 kB
          Adrien Grand
        2. LUCENE-5042.patch
          66 kB
          Adrien Grand

        Activity

          People

            jpountz Adrien Grand
            jpountz Adrien Grand
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: