Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2470

Add conditional braching/merging to Lucene's analysis pipeline


    • Type: New Feature
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:


      Captured from a #lucene brainstorming session with Robert Muir:

      Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.

      Two use cases:

      1. StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
      2. Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold. For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.

      One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged. This could be called BranchingFilter.

      I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint. A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources. Perhaps a conditional merging facility would be useful as well.




            • Assignee:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              0 Start watching this issue


              • Created: