Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Now that we fixed NGramTokenizer and NGramTokenFilter to not produce corrupt token streams, the only way to have "true" offsets for n-grams is to use the tokenizer (the filter emits the offsets of the original token).

      Yet, our NGramTokenizer has a few flaws, in particular:

      • it doesn't have the ability to pre-tokenize the input stream, for example on whitespaces,
      • it doesn't play nice with surrogate pairs.

      Since we already broke backward compatibility for it in 4.4, I'd like to also fix these issues before we release.

        Attachments

        1. LUCENE-5042.patch
          66 kB
          Adrien Grand
        2. LUCENE-5042.patch
          16 kB
          Adrien Grand

          Activity

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: