Lucene - Core
  1. Lucene - Core
  2. LUCENE-3907

Improve the Edge/NGramTokenizer/Filters

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New

      Description

      Our ngram tokenizers/filters could use some love. EG, they output ngrams in multiple passes, instead of "stacked", which messes up offsets/positions and requires too much buffering (can hit OOME for long tokens). They clip at 1024 chars (tokenizers) but don't (token filters). The split up surrogate pairs incorrectly.

      1. LUCENE-3907.patch
        43 kB
        Adrien Grand

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development