Lucene - Core
  1. Lucene - Core
  2. LUCENE-5170

Add getter for reuse strategy to Analyzer, make AnalyzerWrapper's reuse strategy configureable, fix strategy to be stateless

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.4
    • Fix Version/s: 4.5, 6.0
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      If you write an Analyzer that wraps another one (but without using AnalyzerWrapper) you may need use the same reuse strategy in your wrapper. This is not possible as there is no way to get the reuse startegy (private field and no getter).

      An example is ES's NamedAnalyzer, see my comment: https://github.com/elasticsearch/elasticsearch/commit/b9a2fbd8741aa1b9beffb7d2922fc9b4525397e4#src/main/java/org/elasticsearch/index/analysis/NamedAnalyzer.java

      This would add a getter, just a 3-liner.

      1. LUCENE-5170.patch
        25 kB
        Uwe Schindler
      2. LUCENE-5170.patch
        25 kB
        Uwe Schindler
      3. LUCENE-5170.patch
        25 kB
        Uwe Schindler
      4. LUCENE-5170.patch
        0.6 kB
        Uwe Schindler

        Activity

        Hide
        Uwe Schindler added a comment -

        Patch.

        Maybe we should rethink AnalyzerWrapper, too. So it would use the strategy of the wrapped Analyzer, too, unless you have something field-specific? In that case your would pass explicit reuse strategy in the ctor, but the default is the one of the inner analyzer.

        Show
        Uwe Schindler added a comment - Patch. Maybe we should rethink AnalyzerWrapper, too. So it would use the strategy of the wrapped Analyzer, too, unless you have something field-specific? In that case your would pass explicit reuse strategy in the ctor, but the default is the one of the inner analyzer.
        Hide
        Simon Willnauer added a comment -

        +1

        Show
        Simon Willnauer added a comment - +1
        Hide
        Robert Muir added a comment -

        +1. I agree we should rethink AnalyzerWrapper too.

        my preference: just make it a mandatory arg to the protected ctor of this class.

        Show
        Robert Muir added a comment - +1. I agree we should rethink AnalyzerWrapper too. my preference: just make it a mandatory arg to the protected ctor of this class.
        Hide
        Uwe Schindler added a comment - - edited

        Robert: After reviewing the code:
        The fixed-nonchangeable "default" in AnalyzerWrapper is PerField, which is a large overhead and should only be used in stuff like PerFieldAnalyzerWrapper (this class should call super(PerField) in its own ctor). But for other use cases of AnalyzerWrapper I have to use global strategy or the one of a wrapped analyzer). It looks like the current impl in AnalyzerWrapper is somehow assuming you want to wrap per field.

        I would suggest to make it mandatory in Lucene trunk, and add the missing ctor in Lucene 4.x, too. The default one should be deprecated with a hint that it might be a bad idea to use this default.

        My use case is:
        I have lots of predefined Analyzers for several languages or functionality in my search application. I have some additional AnalyzerWrappers around that simply turn any other analyzer into a phonetic one or ASCIIFolding one (so I can use that with another field). So, my wrapper just takes one of these per-language Analyzers and wraps with another additional TokenFilter. As the underlying Analyzer is global reuse, I need to make the wrapper global, too - currently impossible. Per field is a waste of resources in this case.

        Only PerFieldAnalyzerWrapper should use PerField strategy hardcoded (as it is per field), the base class not!

        So I would suggest to make the base class AnalyzerWrapper copy the ctor of the superclass Analyzer and deprecate the default ctor in 4.x. For my above example (to wrap another analyzer), I still need the resuse strategy of the inner analyzer, so I need set getter on Analyzer.java, too (see current patch).

        Show
        Uwe Schindler added a comment - - edited Robert: After reviewing the code: The fixed-nonchangeable "default" in AnalyzerWrapper is PerField, which is a large overhead and should only be used in stuff like PerFieldAnalyzerWrapper (this class should call super(PerField) in its own ctor). But for other use cases of AnalyzerWrapper I have to use global strategy or the one of a wrapped analyzer). It looks like the current impl in AnalyzerWrapper is somehow assuming you want to wrap per field. I would suggest to make it mandatory in Lucene trunk, and add the missing ctor in Lucene 4.x, too. The default one should be deprecated with a hint that it might be a bad idea to use this default. My use case is: I have lots of predefined Analyzers for several languages or functionality in my search application. I have some additional AnalyzerWrappers around that simply turn any other analyzer into a phonetic one or ASCIIFolding one (so I can use that with another field). So, my wrapper just takes one of these per-language Analyzers and wraps with another additional TokenFilter. As the underlying Analyzer is global reuse, I need to make the wrapper global, too - currently impossible. Per field is a waste of resources in this case. Only PerFieldAnalyzerWrapper should use PerField strategy hardcoded (as it is per field), the base class not! So I would suggest to make the base class AnalyzerWrapper copy the ctor of the superclass Analyzer and deprecate the default ctor in 4.x. For my above example (to wrap another analyzer), I still need the resuse strategy of the inner analyzer, so I need set getter on Analyzer.java, too (see current patch).
        Hide
        Robert Muir added a comment -

        I would suggest to make it mandatory in Lucene trunk, and add the missing ctor in Lucene 4.x, too. The default one should be deprecated with a hint that it might be a bad idea to use this default.

        Yes, this is exactly what i think we should do. i really should be a mandatory parameter today (but cannot really work without also having the getter available!)

        Show
        Robert Muir added a comment - I would suggest to make it mandatory in Lucene trunk, and add the missing ctor in Lucene 4.x, too. The default one should be deprecated with a hint that it might be a bad idea to use this default. Yes, this is exactly what i think we should do. i really should be a mandatory parameter today (but cannot really work without also having the getter available!)
        Hide
        Uwe Schindler added a comment -

        There is a major problem:
        Strategy is no strategy at all, it holds state!

        So my idea to make the getter available is wrong, because it would make the private "state" of the analyzer public to the outside! So this is a misdesign in the API. The correct way to do this would be:

        Make the strategy a ENUM like class (no state). The ThreadLocal should not be sitting on the strategy, the strategy should only implement the strategy, not also take care of storing the data in the ThreadLocal.

        I have no idea how to fix this - it looks like we need to backwards break to fix this!

        Show
        Uwe Schindler added a comment - There is a major problem: Strategy is no strategy at all, it holds state! So my idea to make the getter available is wrong, because it would make the private "state" of the analyzer public to the outside! So this is a misdesign in the API. The correct way to do this would be: Make the strategy a ENUM like class (no state). The ThreadLocal should not be sitting on the strategy, the strategy should only implement the strategy, not also take care of storing the data in the ThreadLocal. I have no idea how to fix this - it looks like we need to backwards break to fix this!
        Hide
        Uwe Schindler added a comment - - edited

        The definition of the strategy pattern can be found here, no state involved: http://en.wikipedia.org/wiki/Strategy_pattern

        Show
        Uwe Schindler added a comment - - edited The definition of the strategy pattern can be found here, no state involved: http://en.wikipedia.org/wiki/Strategy_pattern
        Hide
        Robert Muir added a comment -

        Make the strategy a ENUM like class (no state). The ThreadLocal should not be sitting on the strategy, the strategy should only implement the strategy, not also take care of storing the data in the ThreadLocal.

        I like this idea, i think it could simplify the thing a lot.

        I have no idea how to fix this - it looks like we need to backwards break to fix this!

        Personally i support that in this case: because i think we can minimize the breaks at the end of the day.

        For example if we switch to enums, in 4.x, we could still allow 'instantiation' but its just useless (since the object is stateless) and deprecated. and the 'constants' would be declared like MultiTermQuery rewrite?

        Show
        Robert Muir added a comment - Make the strategy a ENUM like class (no state). The ThreadLocal should not be sitting on the strategy, the strategy should only implement the strategy, not also take care of storing the data in the ThreadLocal. I like this idea, i think it could simplify the thing a lot. I have no idea how to fix this - it looks like we need to backwards break to fix this! Personally i support that in this case: because i think we can minimize the breaks at the end of the day. For example if we switch to enums, in 4.x, we could still allow 'instantiation' but its just useless (since the object is stateless) and deprecated. and the 'constants' would be declared like MultiTermQuery rewrite?
        Hide
        Uwe Schindler added a comment - - edited

        Attached is my patch which fixes the stateful strategy, which is a serious bug:

        • The strategy is now looking like RewriteMethod of MTQ
        • The strategy is no longer Closeable (because its stateless)
        • The ThreadLocal is now part of Analyzer itsself (private), the reuse strategy has access to it through the protected final getters and setters (taking the Analyzer instance like in RewriteMethod)
        • The ThreadLocal is closed by the analyzer like before, just directly not through the strategy
        • The 2 default strategies are available as static final constants (singletons in trunk): GLOBAL_REUSE_STRATEGY and PER_FIELD_REUSE_STRATEGY

        To preserve backwards:

        • The public constructors of the implementation classes were kept alive, so code using the ReuseStrategy in the old-fashined stateful way can be kept unchanged. It only reports a deprecation warning to use the constants instead.
        • We only have a backwards break for users that implement their own strategy, which is unlikely, because what else can you do in a strategy? I have no idea, maybe reuse components by checksum of field name? A fieldname prefix (this looks like the only sensible use-case)? Maybe have a lifetime (reuse only for a limited time because synonyms update)?

        All tests pass.

        Show
        Uwe Schindler added a comment - - edited Attached is my patch which fixes the stateful strategy, which is a serious bug: The strategy is now looking like RewriteMethod of MTQ The strategy is no longer Closeable (because its stateless) The ThreadLocal is now part of Analyzer itsself (private), the reuse strategy has access to it through the protected final getters and setters (taking the Analyzer instance like in RewriteMethod ) The ThreadLocal is closed by the analyzer like before, just directly not through the strategy The 2 default strategies are available as static final constants (singletons in trunk): GLOBAL_REUSE_STRATEGY and PER_FIELD_REUSE_STRATEGY To preserve backwards: The public constructors of the implementation classes were kept alive, so code using the ReuseStrategy in the old-fashined stateful way can be kept unchanged. It only reports a deprecation warning to use the constants instead. We only have a backwards break for users that implement their own strategy, which is unlikely, because what else can you do in a strategy? I have no idea, maybe reuse components by checksum of field name? A fieldname prefix (this looks like the only sensible use-case)? Maybe have a lifetime (reuse only for a limited time because synonyms update)? All tests pass.
        Hide
        Uwe Schindler added a comment -

        Javadocs fixes. I think, it is ready to commit, I just want to have some opinions! Robert?

        Show
        Uwe Schindler added a comment - Javadocs fixes. I think, it is ready to commit, I just want to have some opinions! Robert?
        Hide
        Uwe Schindler added a comment -

        This patch removes the crazy NPE handling which makes no sense as it is not performance critical.

        Show
        Uwe Schindler added a comment - This patch removes the crazy NPE handling which makes no sense as it is not performance critical.
        Hide
        Robert Muir added a comment -

        +1

        Thanks for cleaning this up!

        Show
        Robert Muir added a comment - +1 Thanks for cleaning this up!
        Hide
        ASF subversion and git services added a comment -

        Commit 1513903 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1513903 ]

        LUCENE-5170: Add getter for reuse strategy to Analyzer, make AnalyzerWrapper's reuse strategy configurable, fix strategy to be stateless

        Show
        ASF subversion and git services added a comment - Commit 1513903 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1513903 ] LUCENE-5170 : Add getter for reuse strategy to Analyzer, make AnalyzerWrapper's reuse strategy configurable, fix strategy to be stateless
        Hide
        ASF subversion and git services added a comment -

        Commit 1513908 from Uwe Schindler in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1513908 ]

        Merged revision(s) 1513903 from lucene/dev/trunk:
        LUCENE-5170: Add getter for reuse strategy to Analyzer, make AnalyzerWrapper's reuse strategy configurable, fix strategy to be stateless

        Show
        ASF subversion and git services added a comment - Commit 1513908 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1513908 ] Merged revision(s) 1513903 from lucene/dev/trunk: LUCENE-5170 : Add getter for reuse strategy to Analyzer, make AnalyzerWrapper's reuse strategy configurable, fix strategy to be stateless
        Hide
        ASF subversion and git services added a comment -

        Commit 1513914 from Uwe Schindler in branch 'dev/trunk'
        [ https://svn.apache.org/r1513914 ]

        LUCENE-5170: Remove deprecated code

        Show
        ASF subversion and git services added a comment - Commit 1513914 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1513914 ] LUCENE-5170 : Remove deprecated code
        Hide
        Adrien Grand added a comment -

        4.5 release -> bulk close

        Show
        Adrien Grand added a comment - 4.5 release -> bulk close

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Uwe Schindler
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development