Details
-
New Feature
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
4.0-ALPHA
-
None
-
None
-
New
Description
Captured from a #lucene brainstorming session with Robert Muir:
Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.
Two use cases:
- StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
- Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold. For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.
One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged. This could be called BranchingFilter.
I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint. A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources. Perhaps a conditional merging facility would be useful as well.