Lucene - Core
  1. Lucene - Core
  2. LUCENE-3907

Improve the Edge/NGramTokenizer/Filters

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.4
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New

      Description

      Our ngram tokenizers/filters could use some love. EG, they output ngrams in multiple passes, instead of "stacked", which messes up offsets/positions and requires too much buffering (can hit OOME for long tokens). They clip at 1024 chars (tokenizers) but don't (token filters). The split up surrogate pairs incorrectly.

      1. LUCENE-3907.patch
        43 kB
        Adrien Grand

        Issue Links

          Activity

          Michael McCandless created issue -
          Robert Muir made changes -
          Field Original Value New Value
          Link This issue is related to LUCENE-3920 [ LUCENE-3920 ]
          Uwe Schindler made changes -
          Assignee Uwe Schindler [ thetaphi ]
          Robert Muir made changes -
          Fix Version/s 4.1 [ 12321140 ]
          Fix Version/s 4.0 [ 12314025 ]
          Robert Muir made changes -
          Link This issue is related to LUCENE-4641 [ LUCENE-4641 ]
          Steve Rowe made changes -
          Fix Version/s 4.2 [ 12323899 ]
          Fix Version/s 4.1 [ 12321140 ]
          Walter Underwood made changes -
          Link This issue is related to LUCENE-4810 [ LUCENE-4810 ]
          Robert Muir made changes -
          Fix Version/s 4.3 [ 12324143 ]
          Fix Version/s 4.2 [ 12323899 ]
          Adrien Grand made changes -
          Labels gsoc2012 lucene-gsoc-12 gsoc2013
          Adrien Grand made changes -
          Attachment LUCENE-3907.patch [ 12581988 ]
          Adrien Grand made changes -
          Assignee Uwe Schindler [ thetaphi ] Adrien Grand [ jpountz ]
          Uwe Schindler made changes -
          Fix Version/s 4.4 [ 12324323 ]
          Fix Version/s 4.3 [ 12324143 ]
          Adrien Grand made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Steve Rowe made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development