[LUCENE-2841] CommonGramsFilter improvements - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.1, 4.0-ALPHA
Fix Version/s: 4.9, 6.0
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

Currently CommonGramsFilter expects users to remove the common words around which output token ngrams are formed, by appending a StopFilter to the analysis pipeline. This is inefficient in two ways: captureState() is called on (trailing) stopwords, and then the whole stream has to be re-examined by the following StopFilter.

The current ctor should be deprecated, and another ctor added with a boolean option controlling whether the common words should be output as unigrams.

If common words are configured to be output as unigrams, captureState() will still need to be called, as it is now.

If the common words are not configured to be output as unigrams, rather than calling captureState() for the trailing token in each output token ngram, the term text, position and offset can be maintained in the same way as they are now for the leading token: using a System.arrayCopy()'d term buffer and a few ints for positionIncrement and offsetd. The user then no longer would need to append a StopFilter to the analysis chain.

An example illustrating both possibilities should also be added.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

commit-6402a55.patch
24/Dec/12 19:22
23 kB
Itamar Syn-Hershko

Activity

People

Assignee:: Unassigned

Reporter:: Steven Rowe

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 31/Dec/10 19:26

Updated:: 28/Aug/22 12:38