Lucene - Core
  1. Lucene - Core
  2. LUCENE-6033

Add CachingTokenFilter.isCached and switch LinkedList to ArrayList

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      CachingTokenFilter could use a simple boolean isCached() method implemented as-such:

        /** If the underlying token stream was consumed and cached */
        public boolean isCached() {
          return cache != null;
        }
      

      It's useful for the highlighting code to remove its wrapping of CachingTokenFilter if after handing-off to parts of its framework it turns out that it wasn't used.

      Furthermore, use an ArrayList, not a LinkedList. ArrayList is leaner when the token count is high, and this class doesn't manipulate the list in a way that might favor LL.

      A separate patch will come that actually uses this method.

      1. LUCENE-6033_boolean_resetInput_option.patch
        5 kB
        David Smiley
      2. LUCENE-6033.patch
        1 kB
        David Smiley

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          I'll commit this Wednesday morning, subject to feedback.

          Show
          David Smiley added a comment - I'll commit this Wednesday morning, subject to feedback.
          Hide
          ASF subversion and git services added a comment -

          Commit 1638794 from David Smiley in branch 'dev/trunk'
          [ https://svn.apache.org/r1638794 ]

          LUCENE-6033: CachingTokenFilter now uses ArrayList not LinkedList, and has new isCached() method

          Show
          ASF subversion and git services added a comment - Commit 1638794 from David Smiley in branch 'dev/trunk' [ https://svn.apache.org/r1638794 ] LUCENE-6033 : CachingTokenFilter now uses ArrayList not LinkedList, and has new isCached() method
          Hide
          ASF subversion and git services added a comment -

          Commit 1638796 from David Smiley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1638796 ]

          LUCENE-6033: CachingTokenFilter now uses ArrayList not LinkedList, and has new isCached() method

          Show
          ASF subversion and git services added a comment - Commit 1638796 from David Smiley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1638796 ] LUCENE-6033 : CachingTokenFilter now uses ArrayList not LinkedList, and has new isCached() method
          Hide
          David Smiley added a comment -

          This patch adds a "resetStream" constructor option such that fillCache() will propagate reset() if this is set. Very simple. I enhanced the test for this and for isCached(). This option goes hand-in-hand with the use of isCached() for the use-case I had in mind by allowing you to pass a tokenStream to something that might not need it, thereby allowing you to not only toss the CachingTokenFilter if it wasn't actually cached, but avoid a redundant reset() call on the underlying input.

          Show
          David Smiley added a comment - This patch adds a "resetStream" constructor option such that fillCache() will propagate reset() if this is set. Very simple. I enhanced the test for this and for isCached(). This option goes hand-in-hand with the use of isCached() for the use-case I had in mind by allowing you to pass a tokenStream to something that might not need it, thereby allowing you to not only toss the CachingTokenFilter if it wasn't actually cached, but avoid a redundant reset() call on the underlying input.
          Hide
          Robert Muir added a comment -

          I dont understand this option, why do we need it? How is it useful to the consumers that use CachingTokenFilter (like queryparser). It seems more of an abusive case.

          Show
          Robert Muir added a comment - I dont understand this option, why do we need it? How is it useful to the consumers that use CachingTokenFilter (like queryparser). It seems more of an abusive case.
          Hide
          David Smiley added a comment -

          Hi Rob.
          I think it's easier to make the case that CachingTokenFilter should have been propagating reset() from it's fillCache() all along, and thus you would then use CachingTokenFilter in a more normal way – wrap it and call reset() then increment in a loop, etc., instead of knowing you need to reset() on what it wraps but not this token filter itself. That's weird. It's ab-normal for a TokenFilter to never propagate reset, so every user of CachingTokenFilter to date has worked around this by calling reset() on the underlying input instead of the final wrapping token filter (CachingTokenFilter in this case). To be clear, CachingTokenFilter.reset() didn't and still doesn't with this patch propagate reset(), it happens the one time it consumes the stream indirectly via incrementToken().

          The exact case that brought me here is as follows: DefaultSolrHighlighter has a block of code activated when you pass "hl.usePhraseHighlighter" (around line 501). This block of code calls getPhraseHighlighter passing in a token stream that may never actually be used by that method. This is an extension point for subclassing, our shipped code doesn't use it at all. Prior to me doing SOLR-6680, we'd always then pass the CachingTokenFilter further on into the Highlighter. But unless getPhraseHighlighter actually uses the token stream, doing this is a waste (needless caching of every token – pretty bulky). So with isCached() I can now see if it was used, and if not then toss the CachingTokenFilter aside. The problem is that isCached() isn't enough here; I overlooked it in SOLR-6680 (no test for this extension point). I was hoping to simply declare that if you want to use this token stream, you need to call reset() on it first. But CachingTokenFilter doesn't propagate the reset()! So it won't get reset. I could add a reset on the underlying stream before calling getPhraseHighlighter but doing so would likely result in reset() being called twice in a row when the caching isn't needed; Highlighter calls reset(). Test assertions trip when this happens, although I think in practice it's fine.

          Show
          David Smiley added a comment - Hi Rob. I think it's easier to make the case that CachingTokenFilter should have been propagating reset() from it's fillCache() all along, and thus you would then use CachingTokenFilter in a more normal way – wrap it and call reset() then increment in a loop, etc., instead of knowing you need to reset() on what it wraps but not this token filter itself. That's weird. It's ab-normal for a TokenFilter to never propagate reset, so every user of CachingTokenFilter to date has worked around this by calling reset() on the underlying input instead of the final wrapping token filter (CachingTokenFilter in this case). To be clear, CachingTokenFilter. reset() didn't and still doesn't with this patch propagate reset(), it happens the one time it consumes the stream indirectly via incrementToken(). The exact case that brought me here is as follows: DefaultSolrHighlighter has a block of code activated when you pass "hl.usePhraseHighlighter" (around line 501). This block of code calls getPhraseHighlighter passing in a token stream that may never actually be used by that method. This is an extension point for subclassing, our shipped code doesn't use it at all. Prior to me doing SOLR-6680 , we'd always then pass the CachingTokenFilter further on into the Highlighter. But unless getPhraseHighlighter actually uses the token stream, doing this is a waste (needless caching of every token – pretty bulky). So with isCached() I can now see if it was used, and if not then toss the CachingTokenFilter aside. The problem is that isCached() isn't enough here; I overlooked it in SOLR-6680 (no test for this extension point). I was hoping to simply declare that if you want to use this token stream, you need to call reset() on it first. But CachingTokenFilter doesn't propagate the reset()! So it won't get reset. I could add a reset on the underlying stream before calling getPhraseHighlighter but doing so would likely result in reset() being called twice in a row when the caching isn't needed; Highlighter calls reset(). Test assertions trip when this happens, although I think in practice it's fine.
          Hide
          Robert Muir added a comment -

          I think it's easier to make the case that CachingTokenFilter should have been propagating reset() from it's fillCache() all along, and thus you would then use CachingTokenFilter in a more normal way – wrap it and call reset() then increment in a loop, etc., instead of knowing you need to reset() on what it wraps but not this token filter itself. That's weird. It's ab-normal for a TokenFilter to never propagate reset, so every user of CachingTokenFilter to date has worked around this by calling reset() on the underlying input instead of the final wrapping token filter (CachingTokenFilter in this case). To be clear, CachingTokenFilter.reset() didn't and still doesn't with this patch propagate reset(), it happens the one time it consumes the stream indirectly via incrementToken().

          I mostly agree with this. Except i think this filter should propogate reset() from its own reset(), only once (then once cached, it obviously doesnt propogate any such call).

          But then why propose an option? I would rather fix the behavior of the this thing and not have "two ways to do it". Consumers of the TS api already have a hard enough time.

          We can do a break, since its 5.0, and there arent many consumers of cachingtokenfilter. But one of them is a big one, queryparser, which is not good. Another option is to deprecate it and just make a new one beside it with a different name and the new "good" reset behavior.

          Show
          Robert Muir added a comment - I think it's easier to make the case that CachingTokenFilter should have been propagating reset() from it's fillCache() all along, and thus you would then use CachingTokenFilter in a more normal way – wrap it and call reset() then increment in a loop, etc., instead of knowing you need to reset() on what it wraps but not this token filter itself. That's weird. It's ab-normal for a TokenFilter to never propagate reset, so every user of CachingTokenFilter to date has worked around this by calling reset() on the underlying input instead of the final wrapping token filter (CachingTokenFilter in this case). To be clear, CachingTokenFilter.reset() didn't and still doesn't with this patch propagate reset(), it happens the one time it consumes the stream indirectly via incrementToken(). I mostly agree with this. Except i think this filter should propogate reset() from its own reset(), only once (then once cached, it obviously doesnt propogate any such call). But then why propose an option? I would rather fix the behavior of the this thing and not have "two ways to do it". Consumers of the TS api already have a hard enough time. We can do a break, since its 5.0, and there arent many consumers of cachingtokenfilter. But one of them is a big one, queryparser, which is not good. Another option is to deprecate it and just make a new one beside it with a different name and the new "good" reset behavior.
          Hide
          Robert Muir added a comment -

          Also, can we just make a new issue for this? I think its a good idea, but its unrelated to the current issue, which is already committed.

          Show
          Robert Muir added a comment - Also, can we just make a new issue for this? I think its a good idea, but its unrelated to the current issue, which is already committed.
          Hide
          David Smiley added a comment -

          I mostly agree with this. Except i think this filter should propogate reset() from its own reset(), only once (then once cached, it obviously doesnt propogate any such call).

          +1 totally fine with me. That was actually my first cut... but then I wondered if some un-knowning (don't even know it's a CachingTokenFilter) downstream code doesn't call reset due to a bug (on their part) or who knows why. So delaying until the last point possible seemed safer.

          But then why propose an option? I would rather fix the behavior of the this thing and not have "two ways to do it". Consumers of the TS api already have a hard enough time.

          Because of backwards-compatibility. But sure... lets fix this now that I know you're game

          I just created LUCENE-6121. I first added my patch here for a bunch of reasons but if we're going to disturb the status quo then it definitely deserves its own issue.

          Show
          David Smiley added a comment - I mostly agree with this. Except i think this filter should propogate reset() from its own reset(), only once (then once cached, it obviously doesnt propogate any such call). +1 totally fine with me. That was actually my first cut... but then I wondered if some un-knowning (don't even know it's a CachingTokenFilter) downstream code doesn't call reset due to a bug (on their part) or who knows why. So delaying until the last point possible seemed safer. But then why propose an option? I would rather fix the behavior of the this thing and not have "two ways to do it". Consumers of the TS api already have a hard enough time. Because of backwards-compatibility. But sure... lets fix this now that I know you're game I just created LUCENE-6121 . I first added my patch here for a bunch of reasons but if we're going to disturb the status quo then it definitely deserves its own issue.
          Hide
          Robert Muir added a comment -

          +1 totally fine with me. That was actually my first cut... but then I wondered if some un-knowning (don't even know it's a CachingTokenFilter) downstream code doesn't call reset due to a bug (on their part) or who knows why. So delaying until the last point possible seemed safer.

          If we want to make this thing behave "transparently" then consumers must consume correctly. we should not try to correct users mistakes here, but instead just focus on doing the right thing.

          Show
          Robert Muir added a comment - +1 totally fine with me. That was actually my first cut... but then I wondered if some un-knowning (don't even know it's a CachingTokenFilter) downstream code doesn't call reset due to a bug (on their part) or who knows why. So delaying until the last point possible seemed safer. If we want to make this thing behave "transparently" then consumers must consume correctly. we should not try to correct users mistakes here, but instead just focus on doing the right thing.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              David Smiley
              Reporter:
              David Smiley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development