Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1, 2.9, 2.9.1, 3.0
    • Fix Version/s: 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The analyzers in contrib/analyzers should all be marked final. None of the Analyzers should ever be subclassed - users should build their own analyzers if a different combination of filters and Tokenizers is desired.

      1. LUCENE-2100.patch
        23 kB
        Simon Willnauer
      2. LUCENE-2100.patch
        23 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          In my opinion this applies to StandardAnalyzer also (or other non-final analyzers in core)
          Otherwise we can never remove the deprecated setOverridesTokenStreamMethod method.

          Show
          Robert Muir added a comment - In my opinion this applies to StandardAnalyzer also (or other non-final analyzers in core) Otherwise we can never remove the deprecated setOverridesTokenStreamMethod method.
          Hide
          Simon Willnauer added a comment -

          In my opinion this applies to StandardAnalyzer also (or other non-final analyzers in core)

          I would +1 this! Yet is is still a BW-Compat break and I doubt that we can do this in 3.1.

          Show
          Simon Willnauer added a comment - In my opinion this applies to StandardAnalyzer also (or other non-final analyzers in core) I would +1 this! Yet is is still a BW-Compat break and I doubt that we can do this in 3.1.
          Hide
          Robert Muir added a comment -

          Simon what do you suggest? Instead of breaking in 3.1, should we add a warning 'this will become final in X.Y, please do not subclass it, because it is not a good idea' ???

          Uwe did some similar thing with making all the TokenStreams final before but I don't remember what the approach was (maybe just a break)

          Show
          Robert Muir added a comment - Simon what do you suggest? Instead of breaking in 3.1, should we add a warning 'this will become final in X.Y, please do not subclass it, because it is not a good idea' ??? Uwe did some similar thing with making all the TokenStreams final before but I don't remember what the approach was (maybe just a break)
          Hide
          Simon Willnauer added a comment -

          Simon what do you suggest? Instead of breaking in 3.1

          I suggest to move the core analyzer into a separate issue and link those. That way we can make progress here as the bw policy is not that strict or people do not care that much than they do for core analyzers. I doubt that many people have subclassed StandardAnalyzer and if they do they might do something wrong though. Lets have two issues so we can drive the discussion independently from contrib.
          My personal feeling is that we should break it in 3.1 lets see what the other devs object.

          Show
          Simon Willnauer added a comment - Simon what do you suggest? Instead of breaking in 3.1 I suggest to move the core analyzer into a separate issue and link those. That way we can make progress here as the bw policy is not that strict or people do not care that much than they do for core analyzers. I doubt that many people have subclassed StandardAnalyzer and if they do they might do something wrong though. Lets have two issues so we can drive the discussion independently from contrib. My personal feeling is that we should break it in 3.1 lets see what the other devs object.
          Hide
          Simon Willnauer added a comment -

          This patch marks all analyzers in contrib as final and removes the backwards compat tests checking if subclasses implement reusableTokenStream.

          Show
          Simon Willnauer added a comment - This patch marks all analyzers in contrib as final and removes the backwards compat tests checking if subclasses implement reusableTokenStream.
          Hide
          Robert Muir added a comment -

          Hi Simon, this sounds good to me if we clean up contrib first. There are not many analyzers in core anyway (is it just StandardAnalyzer that is not final?)

          My motivation for those was so we could get rid of the deprecated setOverridesTokenStreamMethod method.

          Show
          Robert Muir added a comment - Hi Simon, this sounds good to me if we clean up contrib first. There are not many analyzers in core anyway (is it just StandardAnalyzer that is not final?) My motivation for those was so we could get rid of the deprecated setOverridesTokenStreamMethod method.
          Hide
          Simon Willnauer added a comment -

          There are not many analyzers in core anyway (is it just StandardAnalyzer that is not final?)

          Three of them:

          • StandardAnalyzer
          • KeywordAnalyzer
          • PerFieldAnalyzerWrapper

          My motivation for those was so we could get rid of the deprecated setOverridesTokenStreamMethod method.

          +1 this make me mad each time I look at those analyzers

          Show
          Simon Willnauer added a comment - There are not many analyzers in core anyway (is it just StandardAnalyzer that is not final?) Three of them: StandardAnalyzer KeywordAnalyzer PerFieldAnalyzerWrapper My motivation for those was so we could get rid of the deprecated setOverridesTokenStreamMethod method. +1 this make me mad each time I look at those analyzers
          Hide
          Simon Willnauer added a comment -

          I plan to commit this until 12/09/09 if nobody objects

          Show
          Simon Willnauer added a comment - I plan to commit this until 12/09/09 if nobody objects
          Hide
          Simon Willnauer added a comment -

          will commit this tomorrow if nobody objects

          Show
          Simon Willnauer added a comment - will commit this tomorrow if nobody objects
          Hide
          Robert Muir added a comment -

          +1

          (just my nitpick of s/TokenStreamsm/TokenStreams/ typo in CHANGES...)

          Thanks!

          Show
          Robert Muir added a comment - +1 (just my nitpick of s/TokenStreamsm/TokenStreams/ typo in CHANGES...) Thanks!
          Hide
          Robert Muir added a comment -

          Simon, if you want I will take care of contrib/collation tomorrow. This might make things simpler because I plan on moving it around if no one objects to the patch: LUCENE-2124 has details.

          Show
          Robert Muir added a comment - Simon, if you want I will take care of contrib/collation tomorrow. This might make things simpler because I plan on moving it around if no one objects to the patch: LUCENE-2124 has details.
          Hide
          Simon Willnauer added a comment -

          Updated to latest trunk

          Show
          Simon Willnauer added a comment - Updated to latest trunk
          Hide
          Robert Muir added a comment -

          patch looks good to me!

          Show
          Robert Muir added a comment - patch looks good to me!
          Hide
          Simon Willnauer added a comment -

          committed in revision 888799

          thanks robert for review

          Show
          Simon Willnauer added a comment - committed in revision 888799 thanks robert for review
          Hide
          Esmond Pitt added a comment -

          Did somebody implement this for 3.1.0? StandardAnalyzer became final between 3.0.3 and 3.1.0. This is not acceptable. Binary compatibility must be preserved and to be frank I do not give a good goddam how ugly the code inside looks compared to this requirement.

          Show
          Esmond Pitt added a comment - Did somebody implement this for 3.1.0? StandardAnalyzer became final between 3.0.3 and 3.1.0. This is not acceptable. Binary compatibility must be preserved and to be frank I do not give a good goddam how ugly the code inside looks compared to this requirement.
          Hide
          Steve Rowe added a comment -

          Hi Esmond,

          Take a look at the source code for StandardAnalyzer. Fewer than 50 lines of code there, if you take out the comments. Copy/paste suddenly seems doable. Lucene's Analyzers are best thought of as examples.

          Steve

          Show
          Steve Rowe added a comment - Hi Esmond, Take a look at the source code for StandardAnalyzer . Fewer than 50 lines of code there, if you take out the comments. Copy/paste suddenly seems doable. Lucene's Analyzers are best thought of as examples. Steve
          Hide
          Esmond Pitt added a comment -

          Steve

          Thanks. Maybe you could have a look at this. How do you suggest I recode it?
          I wrote this 7 years ago and cannot now remember anything about it. Quite
          possibly the entire thing is now obsolete, but I've been carting it around
          since before Lucene was even at Apache. All I've ever done is adjust the
          version number.

          ==========================================================
          public class PorterStemAnalyzer extends StandardAnalyzer
          {
          /**

          • Construct a new instance of PorterStemAnalyzer.
            */
            public PorterStemAnalyzer() { super(Version.LUCENE_30); }

          @Override
          public final TokenStream tokenStream(String fieldName, Reader
          reader)

          { return new PorterStemFilter(super.tokenStream(fieldName, reader)); }

          }
          ============================================================

          EJP

          Show
          Esmond Pitt added a comment - Steve Thanks. Maybe you could have a look at this. How do you suggest I recode it? I wrote this 7 years ago and cannot now remember anything about it. Quite possibly the entire thing is now obsolete, but I've been carting it around since before Lucene was even at Apache. All I've ever done is adjust the version number. ========================================================== public class PorterStemAnalyzer extends StandardAnalyzer { /** Construct a new instance of PorterStemAnalyzer. */ public PorterStemAnalyzer() { super(Version.LUCENE_30); } @Override public final TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter(super.tokenStream(fieldName, reader)); } } ============================================================ EJP
          Hide
          Robert Muir added a comment -

          Esmond: hi, what you are doing here is exactly the reason why we made it final.

          By subclassing StandardAnalyzer in this way, the indexer is no longer able to reuse tokenstreams, making analysis very slow and inefficient.

          The easiest way to get your PorterStemAnalyzer is to just use EnglishAnalyzer, which does just this.

          Otherwise if you really want to do it yourself, do it like this:

          Analyzer analyzer = new ReusableAnalyzerBase() {
            protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
              Tokenizer tokenizer = new StandardTokenizer(...);
              TokenStream filteredStream = new StandardFilter(tokenizer, ...);
              filteredStream = new LowerCaseFilterFilter(filteredStream, ...);
              filteredStream = new StopFilterFilter(filteredStream, ...);
              filteredStream = new PorterStemFilter(filteredStream, ...);
              return new TokenStreamComponents(tokenizer, filteredStream);
            }
          };
          

          Please see LUCENE-3055 for more examples and a more thorough explanation.

          The good news is if you implement your analyzer like this, you will see performance improvements!

          Show
          Robert Muir added a comment - Esmond: hi, what you are doing here is exactly the reason why we made it final. By subclassing StandardAnalyzer in this way, the indexer is no longer able to reuse tokenstreams, making analysis very slow and inefficient. The easiest way to get your PorterStemAnalyzer is to just use EnglishAnalyzer, which does just this. Otherwise if you really want to do it yourself, do it like this: Analyzer analyzer = new ReusableAnalyzerBase() { protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new StandardTokenizer(...); TokenStream filteredStream = new StandardFilter(tokenizer, ...); filteredStream = new LowerCaseFilterFilter(filteredStream, ...); filteredStream = new StopFilterFilter(filteredStream, ...); filteredStream = new PorterStemFilter(filteredStream, ...); return new TokenStreamComponents(tokenizer, filteredStream); } }; Please see LUCENE-3055 for more examples and a more thorough explanation. The good news is if you implement your analyzer like this, you will see performance improvements!
          Hide
          Esmond Pitt added a comment -

          Many thanks.

          Show
          Esmond Pitt added a comment - Many thanks.
          Hide
          l0co added a comment -

          A little comment from me, because I just need to use this.

          I'm wondering why people in opensource projects so much like to complicate simple things and like to break the rules of OOP, making a lot obstacles for users, like package visibility, almost everything private, and making final classes or methods. This just prevents to quick use the existing code, override class and make anything you want with this the existing code.

          It looks that this is because we should only use existing code, but not to change it any way, right?

          I want to use PolishAnalyzer, which is already implemented, and add a simple improvement to this class, but I cannot do it now. I need to copy all existing class to my package and write there the extension.

          Great "improvement" for me in this ticket.

          Show
          l0co added a comment - A little comment from me, because I just need to use this. I'm wondering why people in opensource projects so much like to complicate simple things and like to break the rules of OOP, making a lot obstacles for users, like package visibility, almost everything private, and making final classes or methods. This just prevents to quick use the existing code, override class and make anything you want with this the existing code. It looks that this is because we should only use existing code, but not to change it any way, right? I want to use PolishAnalyzer, which is already implemented, and add a simple improvement to this class, but I cannot do it now. I need to copy all existing class to my package and write there the extension. Great "improvement" for me in this ticket.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development