Lucene - Core
  1. Lucene - Core
  2. LUCENE-2470

Add conditional braching/merging to Lucene's analysis pipeline

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Captured from a #lucene brainstorming session with Robert Muir:

      Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given token attribute has a particular value) in a way that did not place that responsibility on individual filters.

      Two use cases:

      1. StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
      2. Stemming might make sense for some stemmer/domain combinations only when token length exceeds some threshold. For example, a user could configure an analyzer to stem only when CharTermAttribute length is greater than 4 characters.

      One potential way to achieve this conditional branching facility is with a new kind of filter that can be configured with one or more following filters and condition(s) under which the filter should be engaged. This could be called BranchingFilter.

      I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline architecture, to have a single pipeline endpoint. A MergingFilter might be useful in its own right, e.g. to collect document data from multiple sources. Perhaps a conditional merging facility would be useful as well.

        Activity

        Steve Rowe created issue -
        Mark Thomas made changes -
        Field Original Value New Value
        Workflow jira [ 12511283 ] Default workflow, editable Closed status [ 12563433 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563433 ] jira [ 12584232 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Steve Rowe
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development