Lucene - Core
  1. Lucene - Core
  2. LUCENE-1636

TokenFilters with a null value in the constructor fail

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      While migrating from 2.4.x to 2.9-dev I found a lot of failing unittests.
      One problem is with TokenFilters that do a super(null) in the constructor.
      I fixed it by changing the constructor to super(new EmptyTokenStream())
      This will cause problems and frustration to others while migrating to 2.9.

      1. LUCENE-1636.patch
        0.6 kB
        Uwe Schindler

        Activity

        Hide
        Michael McCandless added a comment -

        Hmm... that's no good.

        Can you give some more specifics here? Which TokenFilter(s) are doing super(null)?

        What other unit tests are failing?

        Show
        Michael McCandless added a comment - Hmm... that's no good. Can you give some more specifics here? Which TokenFilter(s) are doing super(null)? What other unit tests are failing?
        Hide
        Wouter Heijke added a comment -

        My own filters and my own tests.
        It is just that the behaviour of Lucene has changed in such cases.

        Show
        Wouter Heijke added a comment - My own filters and my own tests. It is just that the behaviour of Lucene has changed in such cases.
        Hide
        Uwe Schindler added a comment -

        Hi Wouter,
        I tend to say, this is Won't fix:

        • You use TokenFilter in an undocumented way. I suspect, that you set the delegate stream later, correct? To prevent this, the protected "input" member should normally be "final", as this is not possible (but worked with old analyzers) - ok, this was an error, to not make the delegate strean final (but we shoudl fix this now)
        • The main problem of changing the non-final delegate stream later lies in the fact, that with 2.9 a new TokenStream API (setUseNewAPI()) was added. This new API does not use the Token class anymore, but uses Attributes. The stream's attributes list must be reused in the TokenFilters. If you initialize a TokenFilter with a specific delegate, the attribute instances are copied. If you change the delegate stream later, the filter still uses the attributes of the original stream and will update the wrong ones, it will not work anymore.

        So we cannot support this wrong behaviour of TokenFilter anymore. But we should fix the final accessor of "input", I will attach a patch.

        Show
        Uwe Schindler added a comment - Hi Wouter, I tend to say, this is Won't fix: You use TokenFilter in an undocumented way. I suspect, that you set the delegate stream later, correct? To prevent this, the protected "input" member should normally be "final", as this is not possible (but worked with old analyzers) - ok, this was an error, to not make the delegate strean final (but we shoudl fix this now ) The main problem of changing the non-final delegate stream later lies in the fact, that with 2.9 a new TokenStream API (setUseNewAPI()) was added. This new API does not use the Token class anymore, but uses Attributes. The stream's attributes list must be reused in the TokenFilters. If you initialize a TokenFilter with a specific delegate, the attribute instances are copied. If you change the delegate stream later, the filter still uses the attributes of the original stream and will update the wrong ones, it will not work anymore. So we cannot support this wrong behaviour of TokenFilter anymore. But we should fix the final accessor of "input", I will attach a patch.
        Hide
        Uwe Schindler added a comment -

        This patch prevents changing the delegate stream.

        Show
        Uwe Schindler added a comment - This patch prevents changing the delegate stream.
        Hide
        Wouter Heijke added a comment -

        Here's an example
        before:
        private MyFilter myFilter = new MyFilter(null);

        after:
        private MyFilter myFilter = new MyFilter(new EmptyTokenStream());

        Show
        Wouter Heijke added a comment - Here's an example before: private MyFilter myFilter = new MyFilter(null); after: private MyFilter myFilter = new MyFilter(new EmptyTokenStream());
        Hide
        Wouter Heijke added a comment -

        Whatever cause it was, it was behaving this way for a long time, changing it to the new way in 3.0 seems more acceptable to me then in 2.9 to me.

        Show
        Wouter Heijke added a comment - Whatever cause it was, it was behaving this way for a long time, changing it to the new way in 3.0 seems more acceptable to me then in 2.9 to me.
        Hide
        Uwe Schindler added a comment -

        If you do not use a delegate TokenStream, extend TokenStream instead of TokenFilter. If you want to later set the delegate, see my comment (and patch to prevent this incorrect usage) above.

        Show
        Uwe Schindler added a comment - If you do not use a delegate TokenStream, extend TokenStream instead of TokenFilter. If you want to later set the delegate, see my comment (and patch to prevent this incorrect usage) above.
        Hide
        Uwe Schindler added a comment -

        What do you want to do with an filter without a delegate? Calling next() would simply NPE, too!

        Show
        Uwe Schindler added a comment - What do you want to do with an filter without a delegate? Calling next() would simply NPE, too!
        Hide
        Michael McCandless added a comment -

        I think we should change this in 2.9, for the reasons Uwe pointed out, to disallow changing the delegate after construction.

        Show
        Michael McCandless added a comment - I think we should change this in 2.9, for the reasons Uwe pointed out, to disallow changing the delegate after construction.
        Hide
        Wouter Heijke added a comment -

        I only hope users will understand this and they realize that 2.9 is not backwards compatible to previous versions! This code in our codebase was added by knowledgeable Lucene developers!

        Show
        Wouter Heijke added a comment - I only hope users will understand this and they realize that 2.9 is not backwards compatible to previous versions! This code in our codebase was added by knowledgeable Lucene developers!
        Hide
        Uwe Schindler added a comment -

        Hi Wouter,
        I still want to find out, what you are trying to do with a TokenFilter without a delegate! Can you explain, why you want to initialize with super(null)?
        If it is because you want to change it later to something non-null, it will not work anymore (this is why I want to make the delgate stream final). So please explain!

        Show
        Uwe Schindler added a comment - Hi Wouter, I still want to find out, what you are trying to do with a TokenFilter without a delegate! Can you explain, why you want to initialize with super(null)? If it is because you want to change it later to something non-null, it will not work anymore (this is why I want to make the delgate stream final). So please explain!
        Hide
        Uwe Schindler added a comment - - edited

        Mike:
        Would this affect backwards compatibility? If we make it final and nobody changes the stream, everything is ok. Is this also the case, when using lucene.jar as a dropin-replacement without recompilation? Will changing a final variable from code, compiled before finalization, be detected by the JVM? Is the compiled code with final still binary compatible to code compiled againt non-final members?
        I think, we should try this out before committing!

        Show
        Uwe Schindler added a comment - - edited Mike: Would this affect backwards compatibility? If we make it final and nobody changes the stream, everything is ok. Is this also the case, when using lucene.jar as a dropin-replacement without recompilation? Will changing a final variable from code, compiled before finalization, be detected by the JVM? Is the compiled code with final still binary compatible to code compiled againt non-final members? I think, we should try this out before committing!
        Hide
        Wouter Heijke added a comment -

        I'm on holiday now, but as far as I recollect (as I was not the author of the code) it was done on some filters that would be used in another situation (similar to a filter) to use the filter's functionality. Also it was used with filters that could not be extended, so a new filter was created, also here the orignal filter's public methods would be called.

        In a way it doesn't matter, it could be done with the api without any problems with the latest few releases that i know of.

        A more elegant way if one would like to introduce this new behaviour is to at least log some kind of error message in the 2.9 release so users would be alarmed that they use the Lucene api in a way that is not supported anymore.

        Show
        Wouter Heijke added a comment - I'm on holiday now, but as far as I recollect (as I was not the author of the code) it was done on some filters that would be used in another situation (similar to a filter) to use the filter's functionality. Also it was used with filters that could not be extended, so a new filter was created, also here the orignal filter's public methods would be called. In a way it doesn't matter, it could be done with the api without any problems with the latest few releases that i know of. A more elegant way if one would like to introduce this new behaviour is to at least log some kind of error message in the 2.9 release so users would be alarmed that they use the Lucene api in a way that is not supported anymore.
        Hide
        Michael McCandless added a comment -

        Good questions Uwe!

        I tested the back-compat by adding this to TestAnalyzers temporarily
        in my local checkout:

        private class TestFilter extends TokenFilter {
          public TestFilter() {
            super(new WhitespaceTokenizer(null));
          }
        }
        
        public void testChangeTokenFilterInput() {
          TokenFilter tf = new TestFilter();
          System.out.println("tf.input = " + tf.input);
          tf.input = null;
        }
        

        Then, ant test-tag -Dtestcase=TestAnalyzers results in:

        [junit] ------------- Standard Output ---------------
        [junit] tf.input = (start=0,end=0,term=)
        [junit] ------------- ---------------- ---------------
        [junit] Testcase: testChangeTokenFilterInput(org.apache.lucene.analysis.TestAnalyzers):	Caused an ERROR
        [junit] null
        [junit] java.lang.IllegalAccessError
        [junit] 	at org.apache.lucene.analysis.TestAnalyzers.testChangeTokenFilterInput(TestAnalyzers.java:143)
        

        So, 1) it doesn't break JAR drop-in-abilility when one references
        input, and 2) indeed at runtime final-ness is enforced by the JRE. So
        I think we should proceed with the change? It is a back-compat break
        for those users who change input after creating a TokenFilter, but
        such a use case was not legal usage of the API, and will specifically
        not work like it did in the past (so back-compat was already broken,
        just in a much more sneaky manner).

        Show
        Michael McCandless added a comment - Good questions Uwe! I tested the back-compat by adding this to TestAnalyzers temporarily in my local checkout: private class TestFilter extends TokenFilter { public TestFilter() { super ( new WhitespaceTokenizer( null )); } } public void testChangeTokenFilterInput() { TokenFilter tf = new TestFilter(); System .out.println( "tf.input = " + tf.input); tf.input = null ; } Then, ant test-tag -Dtestcase=TestAnalyzers results in: [junit] ------------- Standard Output --------------- [junit] tf.input = (start=0,end=0,term=) [junit] ------------- ---------------- --------------- [junit] Testcase: testChangeTokenFilterInput(org.apache.lucene.analysis.TestAnalyzers): Caused an ERROR [junit] null [junit] java.lang.IllegalAccessError [junit] at org.apache.lucene.analysis.TestAnalyzers.testChangeTokenFilterInput(TestAnalyzers.java:143) So, 1) it doesn't break JAR drop-in-abilility when one references input, and 2) indeed at runtime final-ness is enforced by the JRE. So I think we should proceed with the change? It is a back-compat break for those users who change input after creating a TokenFilter, but such a use case was not legal usage of the API, and will specifically not work like it did in the past (so back-compat was already broken, just in a much more sneaky manner).
        Hide
        Uwe Schindler added a comment -

        Thanks, I am still in Japan and had no time to check this out myself. So it seems good. I would suggest to add this to changes.txt with a warning about the change in backwards compatibility.
        The issue itsself should be closed as wont-fix, as the usage of Filters with a null delegate is not specified in API and is not correct.
        The case, Wouter explained, is just a workaround for incorrect usage of filters. He wants to reuse some methods of the filter somewhere else and wants to construct a "dummy" instance for that. He should not do this and better move the code accessible outside of the TokenFilter to a separate class.
        Maybe we should add a special ==null test to the constructor, that early throws a NPE with a short notice, that this usage is no longer supported.

        Show
        Uwe Schindler added a comment - Thanks, I am still in Japan and had no time to check this out myself. So it seems good. I would suggest to add this to changes.txt with a warning about the change in backwards compatibility. The issue itsself should be closed as wont-fix, as the usage of Filters with a null delegate is not specified in API and is not correct. The case, Wouter explained, is just a workaround for incorrect usage of filters. He wants to reuse some methods of the filter somewhere else and wants to construct a "dummy" instance for that. He should not do this and better move the code accessible outside of the TokenFilter to a separate class. Maybe we should add a special ==null test to the constructor, that early throws a NPE with a short notice, that this usage is no longer supported.
        Hide
        Uwe Schindler added a comment -

        Oh, you already committed this

        Show
        Uwe Schindler added a comment - Oh, you already committed this
        Hide
        Michael McCandless added a comment -

        OK I'll add a null check in AttributeSource's ctor.

        Show
        Michael McCandless added a comment - OK I'll add a null check in AttributeSource's ctor.
        Hide
        Wouter Heijke added a comment -

        A good example of this 'abuse' is in ShingleMatrixFilter.java in one of it's constructors.
        Here also the input is not used and replaced by an EmptyTokenStream:

        // set the input to be an empty token stream, we already have the data.
        this.input = new EmptyTokenStream();

        Show
        Wouter Heijke added a comment - A good example of this 'abuse' is in ShingleMatrixFilter.java in one of it's constructors. Here also the input is not used and replaced by an EmptyTokenStream: // set the input to be an empty token stream, we already have the data. this.input = new EmptyTokenStream();
        Hide
        Uwe Schindler added a comment -

        A good example of this 'abuse' is in ShingleMatrixFilter.java in one of it's constructors.

        This class extends TokenStream not TokenFilter! As the input instance member of TokenFilter is now final, it would not even compile. Please note, this is one of the contrib packages, not yet using the new API, so with useNewApi set to true, this TokenStream would fail (see LUCENE-1460). The change, you have the problem with, is caused by the new TokenStream API, and so NULL delegates are not possible!

        Show
        Uwe Schindler added a comment - A good example of this 'abuse' is in ShingleMatrixFilter.java in one of it's constructors. This class extends TokenStream not TokenFilter! As the input instance member of TokenFilter is now final, it would not even compile. Please note, this is one of the contrib packages, not yet using the new API, so with useNewApi set to true, this TokenStream would fail (see LUCENE-1460 ). The change, you have the problem with, is caused by the new TokenStream API, and so NULL delegates are not possible!
        Hide
        Mark Bennett added a comment -

        (trouble posting, forgive if duplicate)
        This change also broke the Japanese morphological SEN / Lucene integration code in lucene-ja. Since Solr 1.4 is based on Lucene 2.9, this will also effectively break SEN for Solr users who upgrade to 1.4.

        I'm not complaining. Reading the above comments, the change was probably the "right" thing to do. I've contacted the author of lucene-ja, and I hope to work on a rewrite to address this.

        I would be interested in any comments you folks might have about the lucene-ja code.

        Class org.apache.lucene.analysis.ja.POSFilter
        Extends org.apache.lucene.analysis.TokenFilter

        Offending code in lucene-ja's POSFilter
        public POSFilter(TokenStream in, Hashtable posTable)

        { super(in); input = in; // <<==== this is a member field of parent TokenFilter table = posTable; }

        This is done in several classes.

        Show
        Mark Bennett added a comment - (trouble posting, forgive if duplicate) This change also broke the Japanese morphological SEN / Lucene integration code in lucene-ja. Since Solr 1.4 is based on Lucene 2.9, this will also effectively break SEN for Solr users who upgrade to 1.4. I'm not complaining. Reading the above comments, the change was probably the "right" thing to do. I've contacted the author of lucene-ja, and I hope to work on a rewrite to address this. I would be interested in any comments you folks might have about the lucene-ja code. Class org.apache.lucene.analysis.ja.POSFilter Extends org.apache.lucene.analysis.TokenFilter Offending code in lucene-ja's POSFilter public POSFilter(TokenStream in, Hashtable posTable) { super(in); input = in; // <<==== this is a member field of parent TokenFilter table = posTable; } This is done in several classes.
        Hide
        Uwe Schindler added a comment -

        This change is also noted in the backwards compatibility section of Lucene 2.9.

        The assignment of filter in the ctor is totally useless, as the super ctor does it already, so it the problem of this third party software that used the API in an undocumented way. I am sorry for your problems, but the author of lucene-ja should provide a fix, if you have the source code available, it is a less important problem, if it is closed source, you have to ask the author to fix it soon.

        Show
        Uwe Schindler added a comment - This change is also noted in the backwards compatibility section of Lucene 2.9. The assignment of filter in the ctor is totally useless, as the super ctor does it already, so it the problem of this third party software that used the API in an undocumented way. I am sorry for your problems, but the author of lucene-ja should provide a fix, if you have the source code available, it is a less important problem, if it is closed source, you have to ask the author to fix it soon.
        Hide
        Mark Bennett added a comment -

        Uwe,

        Your comments were a home run, thanks!

        There are 6 places in the code where constructors needlessly try to assign that. Commenting them all out fixes that problem.

        There's some other stuff to clean up and I'm chatting with the author.

        Thanks again,
        Mark

        Show
        Mark Bennett added a comment - Uwe, Your comments were a home run, thanks! There are 6 places in the code where constructors needlessly try to assign that. Commenting them all out fixes that problem. There's some other stuff to clean up and I'm chatting with the author. Thanks again, Mark

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Wouter Heijke
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development