Lucene - Core
  1. Lucene - Core
  2. LUCENE-400

NGramFilter -- construct n-grams from a TokenStream

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

      Description

      This filter constructs n-grams (token combinations up to a fixed size, sometimes
      called "shingles") from a token stream.

      The filter sets start offsets, end offsets and position increments, so
      highlighting and phrase queries should work.

      Position increments > 1 in the input stream are replaced by filler tokens
      (tokens with termText "_" and endOffset - startOffset = 0) in the output
      n-grams. (Position increments > 1 in the input stream are usually caused by
      removing some tokens, eg. stopwords, from a stream.)

      The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
      Commons-Collections.

      Filter, test case and an analyzer are attached.

      1. LUCENE-400.patch
        26 kB
        Steve Rowe
      2. ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapperTest.java
        5 kB
        Sebastian Kirsch
      3. ASF.LICENSE.NOT.GRANTED--NGramFilterTest.java
        6 kB
        Sebastian Kirsch
      4. ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapper.java
        2 kB
        Sebastian Kirsch
      5. ASF.LICENSE.NOT.GRANTED--NGramFilter.java
        6 kB
        Sebastian Kirsch

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            Sebastian Kirsch
          • Votes:
            5 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development