Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2400

ShingleFilter: don't output all-filler shingles/unigrams; also, convert from TermAttribute to CharTermAttribute

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 3.0.1
    • 3.1, 4.0-ALPHA
    • modules/analysis
    • None
    • New, Patch Available

    Description

      When the input token stream to ShingleFilter has position increments greater than one, filler tokens are inserted for each position for which there is no token in the input token stream. As a result, unigrams (if configured) and shingles can be filler-only. Filler-only output tokens make no sense - these should be removed.

      Also, because TermAttribute has been deprecated in favor of CharTermAttribute, the patch will also convert TermAttribute usages to CharTermAttribute in ShingleFilter.

      Attachments

        1. ASF.LICENSE.NOT.GRANTED--LUCENE-2400.patch
          21 kB
          Steven Rowe
        2. LUCENE-2400.patch
          21 kB
          Steven Rowe
        3. LUCENE-2400.patch
          21 kB
          Steven Rowe
        4. LUCENE-2400.patch
          25 kB
          Steven Rowe

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            uschindler Uwe Schindler
            sarowe Steven Rowe
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment