Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-400

NGramFilter -- construct n-grams from a TokenStream

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 2.4
    • modules/analysis
    • None
    • Operating System: All
      Platform: All

    • Patch Available
    • 35456

    Description

      This filter constructs n-grams (token combinations up to a fixed size, sometimes
      called "shingles") from a token stream.

      The filter sets start offsets, end offsets and position increments, so
      highlighting and phrase queries should work.

      Position increments > 1 in the input stream are replaced by filler tokens
      (tokens with termText "_" and endOffset - startOffset = 0) in the output
      n-grams. (Position increments > 1 in the input stream are usually caused by
      removing some tokens, eg. stopwords, from a stream.)

      The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
      Commons-Collections.

      Filter, test case and an analyzer are attached.

      Attachments

        1. ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapper.java
          2 kB
          Sebastian Kirsch
        2. ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapperTest.java
          5 kB
          Sebastian Kirsch
        3. ASF.LICENSE.NOT.GRANTED--NGramFilter.java
          6 kB
          Sebastian Kirsch
        4. ASF.LICENSE.NOT.GRANTED--NGramFilterTest.java
          6 kB
          Sebastian Kirsch
        5. LUCENE-400.patch
          26 kB
          Steven Rowe

        Activity

          People

            gsingers Grant Ingersoll
            apache-bugzilla@sebastian-kirsch.org Sebastian Kirsch
            Votes:
            5 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment