Lucene - Core
  1. Lucene - Core
  2. LUCENE-2051

Contrib Analyzer Setters should be deprecated and replace with ctor arguments

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 2.9.1
    • Fix Version/s: 3.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Some analyzers in contrib provide setters for stopword / stem exclusion sets / hashtables etc. Those setters should be deprecated as they yield unexpected behaviour. The way they work is they set the reusable token stream instance to null in a thread local cache which only affects the tokenstream in the current thread. Analyzers itself should be immutable except of the threadlocal.

      will attach a patch soon.

      1. LUCENE-2051.patch
        44 kB
        Simon Willnauer
      2. LUCENE-2051.patch
        36 kB
        Simon Willnauer

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          thanks uwe!

          Show
          Simon Willnauer added a comment - thanks uwe!
          Hide
          Uwe Schindler added a comment -

          I committed it for you in revision: 880715

          Simon, you can close it, as you are assigned.

          Show
          Uwe Schindler added a comment - I committed it for you in revision: 880715 Simon, you can close it, as you are assigned.
          Hide
          Robert Muir added a comment -

          sure, they solve two different things.
          static Set<?> getDefaultStopSet() -> will always return the default stopword set. (replacement for the String array
          getStopwords() -> returns the currently used stopwords by the analyzer.

          that is all. If we can get rid of that completely I'm fine with it. I would actually like to not expose the stopwords. We could put them all in files in 3.1 and load them from there?!

          yeah i too prefer they be hidden behind a method, (we should store them however we damn well feel, i do like files though)
          just wanted to make sure it would not somehow present a problem for StopAwareAnalyzer in the future, if it can support getting this set then we are ok.

          Show
          Robert Muir added a comment - sure, they solve two different things. static Set<?> getDefaultStopSet() -> will always return the default stopword set. (replacement for the String array getStopwords() -> returns the currently used stopwords by the analyzer. that is all. If we can get rid of that completely I'm fine with it. I would actually like to not expose the stopwords. We could put them all in files in 3.1 and load them from there?! yeah i too prefer they be hidden behind a method, (we should store them however we damn well feel, i do like files though) just wanted to make sure it would not somehow present a problem for StopAwareAnalyzer in the future, if it can support getting this set then we are ok.
          Hide
          Simon Willnauer added a comment -

          right, but still, will they this static method be supported after refactoring to StopAwareAnalyzer?

          sure, they solve two different things.
          static Set<?> getDefaultStopSet() -> will always return the default stopword set. (replacement for the String array
          getStopwords() -> returns the currently used stopwords by the analyzer.

          that is all. If we can get rid of that completely I'm fine with it. I would actually like to not expose the stopwords. We could put them all in files in 3.1 and load them from there?!

          Show
          Simon Willnauer added a comment - right, but still, will they this static method be supported after refactoring to StopAwareAnalyzer? sure, they solve two different things. static Set<?> getDefaultStopSet() -> will always return the default stopword set. (replacement for the String array getStopwords() -> returns the currently used stopwords by the analyzer. that is all. If we can get rid of that completely I'm fine with it. I would actually like to not expose the stopwords. We could put them all in files in 3.1 and load them from there?!
          Hide
          Robert Muir added a comment -

          This is different. the StopawareAnalyzer#getStopwords() is an instance method to get the "current" stopword set of the instance. while the ones I introduced here are static to get the default set instead. We need to provide a replacement for the public static final Sting[] stuff for deprecation an I thing they have to be there. thoughts?

          right, but still, will they this static method be supported after refactoring to StopAwareAnalyzer?

          Show
          Robert Muir added a comment - This is different. the StopawareAnalyzer#getStopwords() is an instance method to get the "current" stopword set of the instance. while the ones I introduced here are static to get the default set instead. We need to provide a replacement for the public static final Sting[] stuff for deprecation an I thing they have to be there. thoughts? right, but still, will they this static method be supported after refactoring to StopAwareAnalyzer?
          Hide
          Simon Willnauer added a comment -

          Attached complete patch.
          Thanks for the pointer with the bogus import.

          Show
          Simon Willnauer added a comment - Attached complete patch. Thanks for the pointer with the bogus import.
          Hide
          Simon Willnauer added a comment -

          should we expose the getDefaultStopSet() as public yet,

          This is different. the StopawareAnalyzer#getStopwords() is an instance method to get the "current" stopword set of the instance. while the ones I introduced here are static to get the default set instead. We need to provide a replacement for the public static final Sting[] stuff for deprecation an I thing they have to be there. thoughts?

          also, I'm not sure i like the copy() method in CharArraySet, i think it should return a real copy even if it is an EMPTY_SET, and if you give it a CharArraySet it should call .clone() ?

          the deal with this copy method is that StopFilter converts the incoming set to a chararrayset if its not a such already. I want to have all sets in analyzers to be unmodifiable and an instance of ChararraySet. Further they shoud be a real copy as otherwise they could be modified by the caller of the Analyzer ctor. Thats why I introduced this helper as such code was duplicated all over the place.

          nothing to do with your issue, but maybe while we are here cleaning up these ctors we should fix the fact that a lot of these never call super() ?

          Java guarantees that the default super ctor is called implicitly. I would not add all this noise (calling it explicitly) just for the sake of typing super() 20 times.

          Thoughts?

          Show
          Simon Willnauer added a comment - should we expose the getDefaultStopSet() as public yet, This is different. the StopawareAnalyzer#getStopwords() is an instance method to get the "current" stopword set of the instance. while the ones I introduced here are static to get the default set instead. We need to provide a replacement for the public static final Sting[] stuff for deprecation an I thing they have to be there. thoughts? also, I'm not sure i like the copy() method in CharArraySet, i think it should return a real copy even if it is an EMPTY_SET, and if you give it a CharArraySet it should call .clone() ? the deal with this copy method is that StopFilter converts the incoming set to a chararrayset if its not a such already. I want to have all sets in analyzers to be unmodifiable and an instance of ChararraySet. Further they shoud be a real copy as otherwise they could be modified by the caller of the Analyzer ctor. Thats why I introduced this helper as such code was duplicated all over the place. nothing to do with your issue, but maybe while we are here cleaning up these ctors we should fix the fact that a lot of these never call super() ? Java guarantees that the default super ctor is called implicitly. I would not add all this noise (calling it explicitly) just for the sake of typing super() 20 times. Thoughts?
          Hide
          Robert Muir added a comment -

          simon, should we expose the getDefaultStopSet() as public yet, if you are planning on refactoring this stopword stuff in 3.1 anyway? (would you have to then deprecate this in 3.1?)

          also, I'm not sure i like the copy() method in CharArraySet, i think it should return a real copy even if it is an EMPTY_SET, and if you give it a CharArraySet it should call .clone() ?

          other things are minor, in Czech i think there is a spurious import added (javax.print.DocFlavor.CHAR_ARRAY), etc.

          nothing to do with your issue, but maybe while we are here cleaning up these ctors we should fix the fact that a lot of these never call super() ?

          Show
          Robert Muir added a comment - simon, should we expose the getDefaultStopSet() as public yet, if you are planning on refactoring this stopword stuff in 3.1 anyway? (would you have to then deprecate this in 3.1?) also, I'm not sure i like the copy() method in CharArraySet, i think it should return a real copy even if it is an EMPTY_SET, and if you give it a CharArraySet it should call .clone() ? other things are minor, in Czech i think there is a spurious import added (javax.print.DocFlavor.CHAR_ARRAY), etc. nothing to do with your issue, but maybe while we are here cleaning up these ctors we should fix the fact that a lot of these never call super() ?
          Hide
          Simon Willnauer added a comment -

          Attached first patch, CJK and Arabic Analyzer still missing.

          will do ASAP.
          @Robert: please have a quick look. Thx

          Show
          Simon Willnauer added a comment - Attached first patch, CJK and Arabic Analyzer still missing. will do ASAP. @Robert: please have a quick look. Thx
          Hide
          Uwe Schindler added a comment -

          I will create an RC, as soon as we have the remaining issues closed (hopefully soon), ideally at the beginning of the week! Some remaining issues (like changing BW requirements) are not really release critical. Also the change of sysreq page.

          I think only your's, Mike's with default readonly ctors and maybe the N-Gram one are important.

          Show
          Uwe Schindler added a comment - I will create an RC, as soon as we have the remaining issues closed (hopefully soon), ideally at the beginning of the week! Some remaining issues (like changing BW requirements) are not really release critical. Also the change of sysreq page. I think only your's, Mike's with default readonly ctors and maybe the N-Gram one are important.
          Hide
          Robert Muir added a comment -

          Simon, if you are busy, you can assign to me, or we can split the work. let me know

          Show
          Robert Muir added a comment - Simon, if you are busy, you can assign to me, or we can split the work. let me know
          Hide
          Simon Willnauer added a comment -

          Yes I do, this one got a little lost in my schedule... I will try to get the patch done by tomorrow night, would that be ok for 3.0? What's your schedule for the release?

          Show
          Simon Willnauer added a comment - Yes I do, this one got a little lost in my schedule... I will try to get the patch done by tomorrow night, would that be ok for 3.0? What's your schedule for the release?
          Hide
          Uwe Schindler added a comment -

          Simon, are you still working on a patch?

          Show
          Uwe Schindler added a comment - Simon, are you still working on a patch?
          Hide
          Simon Willnauer added a comment -

          This would also include deprecating some of the constructors. I will attach a patch which adds / deprecates ctors for all analyzers having those setters too.

          simon

          Show
          Simon Willnauer added a comment - This would also include deprecating some of the constructors. I will attach a patch which adds / deprecates ctors for all analyzers having those setters too. simon
          Hide
          Robert Muir added a comment -

          I agree. Any one analyzer should really just serve as an example of how to put tokenstreams together.
          They shouldn't try to meet all users needs, but instead be very simple and easy for the user to customize.

          This complexity caused by setters was painful when implementing reusableTokenStream, these setters require special handling and code complexity.
          and there might even still be some bug I introduced in this process, we try our best but these setters make life very complex.

          I would like to see these setters deprecated for 3.0 so that code will be simpler in the future.

          Show
          Robert Muir added a comment - I agree. Any one analyzer should really just serve as an example of how to put tokenstreams together. They shouldn't try to meet all users needs, but instead be very simple and easy for the user to customize. This complexity caused by setters was painful when implementing reusableTokenStream, these setters require special handling and code complexity. and there might even still be some bug I introduced in this process, we try our best but these setters make life very complex. I would like to see these setters deprecated for 3.0 so that code will be simpler in the future.

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Simon Willnauer
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development