Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-233

[PATCH] analyzer refactoring based on CVS HEAD from 6/21/2004

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • modules/analysis
    • None
    • Operating System: All
      Platform: All

    • 29756

    Description

      Hello,

      As mentioned in previous exchanges, notably with Grant Ingersoll, I added some
      new classes to the "analysis" package to meet the requirements of the feature
      request in Bugzilla (http://issues.apache.org/bugzilla/show_bug.cgi?id=28182)
      and did some refactoring while I was under-the-hood. This is an overview of
      the hierarchies per my changes:

      -Analyzer
      --CustomAnalyzer (new abstract class largely based on Grant's BaseAnalyzer) –
      AbstractAnalyzer (new abstract class) ---RussianAnalyzer ---GermanAnalyzer —
      etc.

      -Tokenizer
      --CloneableTokenizer (new abstract class)
      ---StandardTokenizer
      ---CharTokenizer
      ---CJKTokenizer
      ---etc.

      -TokenFilter
      --CloneableTokenFilter (new abstract class) ---AbstractStemFilter (new
      abstract class) ----RussianStemFilter ----GermanStemFilter ----etc.

      -Stemmer (very simple new interface used in AbstractStemFilter) –
      PorterStemmer --RussianStemmer --etc.

      In the attached zip file there are 3 diff files (core.analysis,
      sandbox.analysis, and sandbox.analysis.snowball) and a zip containing the new
      classes for org.apache.lucene.analysis in the lucene core. I tried to minimize
      the irrelevant code changes (e.g. style, spaces, etc.) in the diffs while
      conforming to the code formatting guidelines outlined by Otis. I think there
      were a number of classes in the "analysis" package that didn't conform so
      these diffs may have a lot of noise as I reformatted those classe with my IDE,
      sorry . If the diffs are too painful then let me know and I'll try to prune
      them.

      If there is a TODO list specific to Analyzers, are the below items on that
      list?

      1) move German and Russian packages to sandbox (I think this is on the Lucene
      TODO list)
      2) Analyzer class renaming such that dynamic configuration could return
      classes like Analyzer_ru, Analyzer_de, Analyzer_fr, etc. based on the class
      naming scheme "Analyzer_

      {Locale.toString}

      "
      3) Documentation

      Question, comments, feedback, criticisms are all welcome......

      Regards,
      RBP

      PS - Thanks Grant!

      Attachments

        Activity

          People

            Unassigned Unassigned
            rasik.pandey@ajlsm.com Rasik Pandey
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: