Lucene - Core
  1. Lucene - Core
  2. LUCENE-6121

Fix CachingTokenFilter to propagate reset() the first time

    Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      CachingTokenFilter should have been propagating reset() but only the first time and thus you would then use CachingTokenFilter in a more normal way – wrap it and call reset() then increment in a loop, etc., instead of knowing you need to reset() on what it wraps but not this token filter itself. That's weird. It's ab-normal for a TokenFilter to never propagate reset, so every user of CachingTokenFilter to date has worked around this by calling reset() on the underlying input instead of the final wrapping token filter (CachingTokenFilter in this case).

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          The main challenge with doing exactly what the subject calls for, propagating reset(), is backwards compatibility. Any user of this would have to know, starting with Lucene 5, that they should call reset() on this filter not what it contains. That's easy to overlook and will compile and could lead to some head-scratching periods when an exception first triggers.

          Robert Muir thought maybe a new class is a solution. I think that'd be a shame when the name of this one is perfectly good, and already known to people. I propose instead that the existing constructor be made private, that a new protected one appear (for subclassing) with a boolean resetInput, and a public static wrap(input) method appear to be used instead of the old constructor. Someone upgrading to Lucene 5 will see a compile error forcing them to look at the new API, and we'll spell out that they should now call reset() on this filter, not the input.

          Show
          David Smiley added a comment - The main challenge with doing exactly what the subject calls for, propagating reset(), is backwards compatibility. Any user of this would have to know, starting with Lucene 5, that they should call reset() on this filter not what it contains. That's easy to overlook and will compile and could lead to some head-scratching periods when an exception first triggers. Robert Muir thought maybe a new class is a solution. I think that'd be a shame when the name of this one is perfectly good, and already known to people. I propose instead that the existing constructor be made private, that a new protected one appear (for subclassing) with a boolean resetInput, and a public static wrap(input) method appear to be used instead of the old constructor. Someone upgrading to Lucene 5 will see a compile error forcing them to look at the new API, and we'll spell out that they should now call reset() on this filter, not the input.
          Hide
          Robert Muir added a comment -

          As i said on the previous issue, I am -1 to this boolean.

          Just fix the behavior of the class. Its a major release, we can break anything.

          Show
          Robert Muir added a comment - As i said on the previous issue, I am -1 to this boolean. Just fix the behavior of the class. Its a major release, we can break anything.
          Hide
          David Smiley added a comment -

          Ok, nevermind the protected constructor with the boolean. How do you feel about wrap(input)? I like it. This ensures people get a compile error triggering them to look at the API. The existing constructor could be protected.

          Show
          David Smiley added a comment - Ok, nevermind the protected constructor with the boolean. How do you feel about wrap(input)? I like it. This ensures people get a compile error triggering them to look at the API. The existing constructor could be protected.
          Hide
          Robert Muir added a comment -

          If this one has the name you want, then we just change the behavior and thats it. Please, no booleans, factories, builders, or other unnecessary abstractions. Static factory methods are terrible. You cannot tell what they are doing... we should always avoid this kind of garbage.

          Show
          Robert Muir added a comment - If this one has the name you want, then we just change the behavior and thats it. Please, no booleans, factories, builders, or other unnecessary abstractions. Static factory methods are terrible. You cannot tell what they are doing... we should always avoid this kind of garbage.
          Hide
          David Smiley added a comment -

          The attached patch propagates reset() on reset() only when the stream hasn't been cached yet. This is good/intuitive behavior, simple, and actually isn't as much of a change to existing users; many might not even notice. If you still call reset on the input (but not in CachingTokenFilter) before then calling incrementToken, it'll all work out fine. You shouldn't to follow the spirit of our API, but you can. In fact there's one query builder ('flexible' AnalyzerQueryNodeProcessor) that I didn't change to move what it calls reset() on and it still works.

          In the patch you may notice I moved the reset() to be before incrementToken() – I find that flow clearest.

          I did have to make a change to the default highlighter. WeightedSpanTermExtractor handed off it's stream to MemoryIndex, and when done, it called reset() as the last thing it did. That's bad behavior, IMO but it turned out to (previously) be necessary because Highlighter called reset() before passing the tokenStream to QueryScorer/WSTE. I fixed this so that WSTE doesn't call reset (it doesn't call incrementToken itself, after all), and moved Highlighter's invocation of reset() to the last possible moment, right before the loop of incrementToken(). I think this is best practice in general – always call reset() as close to incrementToken() as you can.

          In CHANGES.txt I'll say this:

          CachingTokenFilter now propagates reset() to its input if incrementToken() hasn't been called yet. You should generally call reset() now on this token filter instead of doing it a-priori on its input (which previously didn't work).

          Show
          David Smiley added a comment - The attached patch propagates reset() on reset() only when the stream hasn't been cached yet. This is good/intuitive behavior, simple, and actually isn't as much of a change to existing users; many might not even notice. If you still call reset on the input (but not in CachingTokenFilter) before then calling incrementToken, it'll all work out fine. You shouldn't to follow the spirit of our API, but you can. In fact there's one query builder ('flexible' AnalyzerQueryNodeProcessor) that I didn't change to move what it calls reset() on and it still works. In the patch you may notice I moved the reset() to be before incrementToken() – I find that flow clearest. I did have to make a change to the default highlighter. WeightedSpanTermExtractor handed off it's stream to MemoryIndex, and when done, it called reset() as the last thing it did. That's bad behavior, IMO but it turned out to (previously) be necessary because Highlighter called reset() before passing the tokenStream to QueryScorer/WSTE. I fixed this so that WSTE doesn't call reset (it doesn't call incrementToken itself, after all), and moved Highlighter's invocation of reset() to the last possible moment, right before the loop of incrementToken(). I think this is best practice in general – always call reset() as close to incrementToken() as you can. In CHANGES.txt I'll say this: CachingTokenFilter now propagates reset() to its input if incrementToken() hasn't been called yet. You should generally call reset() now on this token filter instead of doing it a-priori on its input (which previously didn't work).
          Hide
          Robert Muir added a comment -

          If you still call reset on the input (but not in CachingTokenFilter) before then calling incrementToken, it'll all work out fine.

          How is that? MockTokenizer should be throwing an exception here. its not ok to reset twice. Please change the text to "you must".

          In fact there's one query builder ('flexible' AnalyzerQueryNodeProcessor) that I didn't change to move what it calls reset() on and it still works.

          Then it should be fixed, ideally to use MockAnalyzer so the double-reset causes a test failure first.

          We dont need an API that has "flexibility", consumers must follow the normal contract and it will work:
          reset() .. incrementToken().... end(), close()

          I am also ok if we only forward close() a single time for now. I do not trust that all TokenStreams obey Closeable yet and will ignore the second invocation. But technically, CachingTokenFilter shouldn't need to do anything special there.

          Show
          Robert Muir added a comment - If you still call reset on the input (but not in CachingTokenFilter) before then calling incrementToken, it'll all work out fine. How is that? MockTokenizer should be throwing an exception here. its not ok to reset twice. Please change the text to "you must". In fact there's one query builder ('flexible' AnalyzerQueryNodeProcessor) that I didn't change to move what it calls reset() on and it still works. Then it should be fixed, ideally to use MockAnalyzer so the double-reset causes a test failure first. We dont need an API that has "flexibility", consumers must follow the normal contract and it will work: reset() .. incrementToken().... end(), close() I am also ok if we only forward close() a single time for now. I do not trust that all TokenStreams obey Closeable yet and will ignore the second invocation. But technically, CachingTokenFilter shouldn't need to do anything special there.
          Hide
          David Smiley added a comment -

          How is that? MockTokenizer should be throwing an exception here. its not ok to reset twice.

          To be clear, this worked and still works (though we might not want to advertise that it does): this is what I meant by it'll all work out fine

          input.reset();
          buffer = new CachingTokenFilter(input);
          buffer.incrementToken() ...
          

          This used to work (even though it was bad) and doesn't work any longer – MockTokenizer will see the second reset() and throw:

          input.reset();
          buffer = new CachingTokenFilter(input);
          buffer.reset(); // CachingTokenizer used to swallow this but now won't.  This is what was indirectly happening in my story about Highlighter/WSTE/MemoryIndex but I fixed Highlighter & WSTE.
          buffer.incrementToken() ...
          

          MockTokenizer never 'sees' any reset past the first because if you c

          Please change the text to "you must".

          Ok.

          > In fact there's one query builder ('flexible' AnalyzerQueryNodeProcessor) that I didn't change to move what it calls reset() on and it still works.

          Then it should be fixed, ideally to use MockAnalyzer so the double-reset causes a test failure first.

          Of course, I left it temporarily to show this isn't going to be the backwards-break we thought it would be. I fixed it in this patch. I also added a double-reset test to CachingTokenFilter.

          RE close(), I could add a boolean closed flag so we only propagate once and only if cache != null. I'd rather not bother. It's seems you're fine either way so I'll let it be.

          Do you think this patch is ready to be committed? I'll do a "precommit" before literally doing it.

          Show
          David Smiley added a comment - How is that? MockTokenizer should be throwing an exception here. its not ok to reset twice. To be clear, this worked and still works (though we might not want to advertise that it does): this is what I meant by it'll all work out fine input.reset(); buffer = new CachingTokenFilter(input); buffer.incrementToken() ... This used to work (even though it was bad) and doesn't work any longer – MockTokenizer will see the second reset() and throw: input.reset(); buffer = new CachingTokenFilter(input); buffer.reset(); // CachingTokenizer used to swallow this but now won't. This is what was indirectly happening in my story about Highlighter/WSTE/MemoryIndex but I fixed Highlighter & WSTE. buffer.incrementToken() ... MockTokenizer never 'sees' any reset past the first because if you c Please change the text to "you must". Ok. > In fact there's one query builder ('flexible' AnalyzerQueryNodeProcessor) that I didn't change to move what it calls reset() on and it still works. Then it should be fixed, ideally to use MockAnalyzer so the double-reset causes a test failure first. Of course, I left it temporarily to show this isn't going to be the backwards-break we thought it would be. I fixed it in this patch. I also added a double-reset test to CachingTokenFilter. RE close(), I could add a boolean closed flag so we only propagate once and only if cache != null. I'd rather not bother. It's seems you're fine either way so I'll let it be. Do you think this patch is ready to be committed? I'll do a "precommit" before literally doing it.
          Hide
          Robert Muir added a comment -

          Looks great! Thank you.

          Can we not mask the exception in QueryBuilder? it shouldn't be possible, so we can just wrap with RuntimeException.

          Show
          Robert Muir added a comment - Looks great! Thank you. Can we not mask the exception in QueryBuilder? it shouldn't be possible, so we can just wrap with RuntimeException.
          Hide
          ASF subversion and git services added a comment -

          Commit 1646737 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1646737 ]

          LUCENE-6121: CachingTokenFilter.reset() propagates to input if not cached

          Show
          ASF subversion and git services added a comment - Commit 1646737 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1646737 ] LUCENE-6121 : CachingTokenFilter.reset() propagates to input if not cached
          Hide
          ASF subversion and git services added a comment -

          Commit 1646767 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1646767 ]

          LUCENE-6121: CachingTokenFilter.reset() propagates to input if not cached

          Show
          ASF subversion and git services added a comment - Commit 1646767 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1646767 ] LUCENE-6121 : CachingTokenFilter.reset() propagates to input if not cached
          Hide
          David Smiley added a comment -

          Thanks for your input Rob.

          FYI I moved the QueryBuilder buffer.reset() to the end of the previous try-catch that wraps with RuntimeException, so it'll not be swallowed.

          Show
          David Smiley added a comment - Thanks for your input Rob. FYI I moved the QueryBuilder buffer.reset() to the end of the previous try-catch that wraps with RuntimeException, so it'll not be swallowed.
          Hide
          Robert Muir added a comment -

          I don't like that: you can't follow the logic flow: because reset is in a crazy place.

          Please, just wrap with runtimeexception.

          Show
          Robert Muir added a comment - I don't like that: you can't follow the logic flow: because reset is in a crazy place. Please, just wrap with runtimeexception.
          Hide
          David Smiley added a comment -

          Sure; that point occurred to me too.

          If there is no term attribute (as strange as that may be), then this reset() call needs to be guarded by if (numTokens > 0) or else a double-reset will ensue. See it?

          Also, I'd like to remove the needless = null|false initializations to variables where it's not needed due to the compiler figuring out it's not needed (which IntelliJ helpfully points out).

          Show
          David Smiley added a comment - Sure; that point occurred to me too. If there is no term attribute (as strange as that may be), then this reset() call needs to be guarded by if (numTokens > 0) or else a double-reset will ensue. See it? Also, I'd like to remove the needless = null|false initializations to variables where it's not needed due to the compiler figuring out it's not needed (which IntelliJ helpfully points out).
          Hide
          David Smiley added a comment -

          ... cleaner to just move the numTokens==0 to before the reset, I think

          Show
          David Smiley added a comment - ... cleaner to just move the numTokens==0 to before the reset, I think
          Hide
          Robert Muir added a comment -

          Can you just wrap with runtimeexception and try making minimal changes to the code as it was for this issue?

          This method is e.g. used by all queryparsers and is complex. This is not the place to try to be cute to save a try/catch. It is important that logic and flow make sense, because its already confusing.

          I am sure there are other issues with the method besides the no-term-att-case, because its complicated. But that means its even more important to be cautious, e.g. approach this in a separate issue and so on.

          Show
          Robert Muir added a comment - Can you just wrap with runtimeexception and try making minimal changes to the code as it was for this issue? This method is e.g. used by all queryparsers and is complex. This is not the place to try to be cute to save a try/catch. It is important that logic and flow make sense, because its already confusing. I am sure there are other issues with the method besides the no-term-att-case, because its complicated. But that means its even more important to be cautious, e.g. approach this in a separate issue and so on.
          Hide
          Robert Muir added a comment -

          ... cleaner to just move the numTokens==0 to before the reset, I think

          Please, I do not want these changes under this issue. The current code was already committed in a cowboy way.

          Trunk tests are failing. branch_5x compile does not fail.

          More care is needed here.

          Show
          Robert Muir added a comment - ... cleaner to just move the numTokens==0 to before the reset, I think Please, I do not want these changes under this issue. The current code was already committed in a cowboy way. Trunk tests are failing. branch_5x compile does not fail. More care is needed here.
          Hide
          Michael McCandless added a comment -

          Can we rollback this change for now? Sounds like it needs a little more iterating here first...

          Show
          Michael McCandless added a comment - Can we rollback this change for now? Sounds like it needs a little more iterating here first...
          Hide
          ASF subversion and git services added a comment -

          Commit 1646793 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1646793 ]

          LUCENE-6121: Fix a test to assumeTrue assertions enabled.

          Show
          ASF subversion and git services added a comment - Commit 1646793 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1646793 ] LUCENE-6121 : Fix a test to assumeTrue assertions enabled.
          Hide
          ASF subversion and git services added a comment -

          Commit 1646794 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1646794 ]

          LUCENE-6121: QueryBuilder put 2nd reset() to where it was, and add test to ensure no double-reset if there is no term attribute.

          Show
          ASF subversion and git services added a comment - Commit 1646794 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1646794 ] LUCENE-6121 : QueryBuilder put 2nd reset() to where it was, and add test to ensure no double-reset if there is no term attribute.
          Hide
          ASF subversion and git services added a comment -

          Commit 1646796 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1646796 ]

          LUCENE-6121: QueryBuilder put 2nd reset() to where it was, and add test to ensure no double-reset if there is no term attribute.

          Show
          ASF subversion and git services added a comment - Commit 1646796 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1646796 ] LUCENE-6121 : QueryBuilder put 2nd reset() to where it was, and add test to ensure no double-reset if there is no term attribute.
          Hide
          David Smiley added a comment -

          I appreciate it may have seemed "cowboy" from where you sit but I ran tests on trunk & 5x, precommit as well. Of course this doesn't guarantee no problems.
          1. I should switch my 5x environment shell to switch my java environment to Java 7, not my default (Java 8). Thanks for the quick fix Mike.
          2. The test failure seen on trunk was because the test I added fundamentally requires MockTokenizer to do its thing, and when assertions aren't enabled (randomly chosen) then it won't. So I committed this moments ago:

          assumeTrue("We want MockAnalyzer to detect double-reset", TEST_ASSERTS_ENABLED);
          

          I anticipate both trunk/5x should be stable now.

          I am sure there are other issues with the method besides the no-term-att-case, because its complicated. But that means its even more important to be cautious, e.g. approach this in a separate issue and so on.

          I added a test to TestQueryBuilder that tests the behavior we want when there is no term attribute, since clearly QueryBuilder was written to support that yet it wasn't tested.

          Please, I do not want these changes under this issue.

          I don't agree on insisting on a heavy-weight process for minor code clarity improvements. It's not worth it to create an issue over such things; so it sits there the way it is. As it is, we need to commit to multiple branches, which is a PITA. But fine, you have a right to your opinion.

          Show
          David Smiley added a comment - I appreciate it may have seemed "cowboy" from where you sit but I ran tests on trunk & 5x, precommit as well. Of course this doesn't guarantee no problems. 1. I should switch my 5x environment shell to switch my java environment to Java 7, not my default (Java 8). Thanks for the quick fix Mike. 2. The test failure seen on trunk was because the test I added fundamentally requires MockTokenizer to do its thing, and when assertions aren't enabled (randomly chosen) then it won't. So I committed this moments ago: assumeTrue( "We want MockAnalyzer to detect double -reset" , TEST_ASSERTS_ENABLED); I anticipate both trunk/5x should be stable now. I am sure there are other issues with the method besides the no-term-att-case, because its complicated. But that means its even more important to be cautious, e.g. approach this in a separate issue and so on. I added a test to TestQueryBuilder that tests the behavior we want when there is no term attribute, since clearly QueryBuilder was written to support that yet it wasn't tested. Please, I do not want these changes under this issue. I don't agree on insisting on a heavy-weight process for minor code clarity improvements. It's not worth it to create an issue over such things; so it sits there the way it is. As it is, we need to commit to multiple branches, which is a PITA. But fine, you have a right to your opinion.
          Hide
          Michael McCandless added a comment -

          I should switch my 5x environment shell to switch my java environment to Java 7, not my default (Java 8). Thanks for the quick fix Mike.

          Actually, even if you use Java 8 on 5.x, it fails compilation (thanks to javac.source=1.7).

          Still it is of course best to run compilation/tests with java7 on backport ... but it's easy to forget ...

          Show
          Michael McCandless added a comment - I should switch my 5x environment shell to switch my java environment to Java 7, not my default (Java 8). Thanks for the quick fix Mike. Actually, even if you use Java 8 on 5.x, it fails compilation (thanks to javac.source=1.7). Still it is of course best to run compilation/tests with java7 on backport ... but it's easy to forget ...

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development