Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-2470

Add conditional braching/merging to Lucene's analysis pipeline

Details

    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 4.0-ALPHA
    • None
    • modules/analysis
    • None
    • New

    Description

      Captured from a #lucene brainstorming session with Robert Muir:

      Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.

      Two use cases:

      1. StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
      2. Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold. For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.

      One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged. This could be called BranchingFilter.

      I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint. A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources. Perhaps a conditional merging facility would be useful as well.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sarowe Steven Rowe
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: