[LUCENE-400] NGramFilter -- construct n-grams from a TokenStream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.4
Component/s: modules/analysis
Labels:
None
Environment:

Operating System: All
Platform: All

Lucene Fields:

Patch Available
Bugzilla Id:
35456

Description

This filter constructs n-grams (token combinations up to a fixed size, sometimes
called "shingles") from a token stream.

The filter sets start offsets, end offsets and position increments, so
highlighting and phrase queries should work.

Position increments > 1 in the input stream are replaced by filler tokens
(tokens with termText "_" and endOffset - startOffset = 0) in the output
n-grams. (Position increments > 1 in the input stream are usually caused by
removing some tokens, eg. stopwords, from a stream.)

The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
Commons-Collections.

Filter, test case and an analyzer are attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ASF.LICENSE.NOT.GRANTED--NGramFilter.java
22/Jun/05 06:09
6 kB
Sebastian Kirsch
ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapper.java
22/Jun/05 06:10
2 kB
Sebastian Kirsch
ASF.LICENSE.NOT.GRANTED--NGramFilterTest.java
22/Jun/05 06:12
6 kB
Sebastian Kirsch
ASF.LICENSE.NOT.GRANTED--NGramAnalyzerWrapperTest.java
29/Jul/05 21:56
5 kB
Sebastian Kirsch
LUCENE-400.patch
14/Jan/08 04:15
26 kB
Steven Rowe

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Sebastian Kirsch

Votes:: 5 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Jun/05 06:08

Updated:: 28/Aug/22 11:22

Resolved:: 29/Mar/08 21:09