Lucene - Core
  1. Lucene - Core
  2. LUCENE-4489

improve LimitTokenCountFilter and/or it's tests

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      spinning off a discussion about LimitTokenCountFilter and it's tests from SOLR-3961 (which was about a specific bug in the LimitTokenCountFilterFactory)

      1. LUCENE-4489.patch
        12 kB
        Hoss Man
      2. LUCENE-4489.patch
        9 kB
        Robert Muir
      3. LUCENE-4489.patch
        9 kB
        Robert Muir
      4. LUCENE-4489.patch
        8 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          In SOLR-3961 rmuir and i were discussing the tests for LimitTokenCountFilter. it seems to me like there are at least 2 things that should be improved in the tests, and one possible improvement to the code itself...

          A) fix TestLimitTokenCountAnalyzer to use MockTokenizer...

          3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer - i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use MockTokenizer?

          I think we should always do this!

          B) testLimitTokenCountAnalyzer.testLimitTokenCountIndexWriter smells fishy to me, still discussing with rmuir...

          4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add the exact number of tokens that the limit is configured for, and then asserts that the last token is in the index - but never actaully triggers the limiting logic since exactly the allowed umber of tokens are used.

          Then thats fine, because when LimitTokenCountFilter consumes the whole stream, its a "good consumer". its only when it actually truncates that it breaks the tokenstream contract.

          My concern wasn't that i thought that test was fishy just because it didn't call setEnableChecks(false) – my concern was that i don't see the point of the test at all as it's currently written, because it doesn't actually trigger the token limiting behavior (if it did it would have to call setEnableChecks(false) ... so what exactly is it suppose to be testing?

          C) going back to a comment rmuir made earlier in SOLR-3961...

          Thats because LimitTokenCountFilter violates the workflow of the standard tokenstream API ... it cuts off its consumer and calls end() even while that consumer still has more tokens: this could easily cause unexpected side effects and bugs. ... But we knew about this anyway

          This makes me wonder if it would be worth while to add an option to LimitTokenCountFilter to make it always consume all the tokens from the stream it wraps – it could be off by default so it would be havle exactly like today, but if you are wrapping a TokenStream where it's really important to you that every token is consumed before end() is called, you could set it appropriately.

          would that make sense?

          Show
          Hoss Man added a comment - In SOLR-3961 rmuir and i were discussing the tests for LimitTokenCountFilter. it seems to me like there are at least 2 things that should be improved in the tests, and one possible improvement to the code itself... A) fix TestLimitTokenCountAnalyzer to use MockTokenizer... 3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer - i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use MockTokenizer? I think we should always do this! B) testLimitTokenCountAnalyzer.testLimitTokenCountIndexWriter smells fishy to me, still discussing with rmuir... 4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add the exact number of tokens that the limit is configured for, and then asserts that the last token is in the index - but never actaully triggers the limiting logic since exactly the allowed umber of tokens are used. Then thats fine, because when LimitTokenCountFilter consumes the whole stream, its a "good consumer". its only when it actually truncates that it breaks the tokenstream contract. My concern wasn't that i thought that test was fishy just because it didn't call setEnableChecks(false) – my concern was that i don't see the point of the test at all as it's currently written, because it doesn't actually trigger the token limiting behavior (if it did it would have to call setEnableChecks(false) ... so what exactly is it suppose to be testing? C) going back to a comment rmuir made earlier in SOLR-3961 ... Thats because LimitTokenCountFilter violates the workflow of the standard tokenstream API ... it cuts off its consumer and calls end() even while that consumer still has more tokens: this could easily cause unexpected side effects and bugs. ... But we knew about this anyway This makes me wonder if it would be worth while to add an option to LimitTokenCountFilter to make it always consume all the tokens from the stream it wraps – it could be off by default so it would be havle exactly like today, but if you are wrapping a TokenStream where it's really important to you that every token is consumed before end() is called, you could set it appropriately. would that make sense?
          Hide
          Hoss Man added a comment -

          here's a quick patch i tried to throw together to demonstrate what i had in mind.

          both tests in TestLimitTokenCountAnalyzer now fail for reasons that aren't really clear to me – not sure if it's because: i made a stupid mistake somewhere; i broke something i don't understand; MockAnalyzer behaves different enough from WhitespaceAnalyzer that the previous assertions need to be changed; there's some other bug under the covers i've exposed; some combination of the above.

          Show
          Hoss Man added a comment - here's a quick patch i tried to throw together to demonstrate what i had in mind. both tests in TestLimitTokenCountAnalyzer now fail for reasons that aren't really clear to me – not sure if it's because: i made a stupid mistake somewhere; i broke something i don't understand; MockAnalyzer behaves different enough from WhitespaceAnalyzer that the previous assertions need to be changed; there's some other bug under the covers i've exposed; some combination of the above.
          Hide
          Robert Muir added a comment -

          Some combination of the above, attached is an updated patch:

          • previous patch had a test bug where it reused the same directory in the loop, so the docFreq() would be wrong as it kept adding documents.
          • there was another test bug where it looped form 0..limit and then added "x" but checked docFreq=1, this won't happen as the limit was exceeded (I changed the loop to 1..limit).
          • previous assertions about finalOffset from end() had wrong values that depended upon implementation details: thats the whole bug here! so these assertions are correct now: if you pass consumeAll = true, the finalOffset is correct, highlighting on multivalued fields with limiting will work correctly and so on. If you pass consumeAll=false, its impl dependent (and likely will be wrong, just as before). p.s. would be better to still improve the test to actually not assert endOffset at all if consumeAll=false, because my "fix" still keeps the test fragile and dependent on MockTokenizer impl in the "wrong" case.
          • previous patch also had a real bug: if you passed consumeAllTokens=true, but the stream had less tokens than the limit, it would incorrectly call incrementToken() after it returned false.
          • fixes for javadocs errors (see references needed #) and typos.
          Show
          Robert Muir added a comment - Some combination of the above, attached is an updated patch: previous patch had a test bug where it reused the same directory in the loop, so the docFreq() would be wrong as it kept adding documents. there was another test bug where it looped form 0..limit and then added "x" but checked docFreq=1, this won't happen as the limit was exceeded (I changed the loop to 1..limit). previous assertions about finalOffset from end() had wrong values that depended upon implementation details: thats the whole bug here! so these assertions are correct now: if you pass consumeAll = true, the finalOffset is correct, highlighting on multivalued fields with limiting will work correctly and so on. If you pass consumeAll=false, its impl dependent (and likely will be wrong, just as before). p.s. would be better to still improve the test to actually not assert endOffset at all if consumeAll=false, because my "fix" still keeps the test fragile and dependent on MockTokenizer impl in the "wrong" case. previous patch also had a real bug: if you passed consumeAllTokens=true, but the stream had less tokens than the limit, it would incorrectly call incrementToken() after it returned false. fixes for javadocs errors (see references needed #) and typos.
          Hide
          Robert Muir added a comment -

          same patch as the above, just passing null in the test for finalOffset when consumeAll is false.

          This way the test isn't fragile, we know that if you specify false for consumeAll that its likely wrong and will screw up highlighting for multivalued fields, etc (since it doesnt consume the whole stream, it doesnt necessarily know the real "end").

          Show
          Robert Muir added a comment - same patch as the above, just passing null in the test for finalOffset when consumeAll is false. This way the test isn't fragile, we know that if you specify false for consumeAll that its likely wrong and will screw up highlighting for multivalued fields, etc (since it doesnt consume the whole stream, it doesnt necessarily know the real "end").
          Hide
          Hoss Man added a comment -

          Thanks rob,

          (I changed the loop to 1..limit).

          ...i could not for the life of me make sense of this and how the existing test worked at all, until i realized it wasn't realy testing anything about the filter – was using 10^3 in one place and 10^4 in another – so the limit never even got hit.

          I've updated the patch to add this as an option in the factory, and include some docs explaining the trade off in using the new option and what the default is.

          (would appreciate sanity check of the wording in the javadocs)

          Show
          Hoss Man added a comment - Thanks rob, (I changed the loop to 1..limit). ...i could not for the life of me make sense of this and how the existing test worked at all, until i realized it wasn't realy testing anything about the filter – was using 10^3 in one place and 10^4 in another – so the limit never even got hit. I've updated the patch to add this as an option in the factory, and include some docs explaining the trade off in using the new option and what the default is. (would appreciate sanity check of the wording in the javadocs)
          Hide
          Robert Muir added a comment -

          Wording looks good! Thanks Hoss!

          Show
          Robert Muir added a comment - Wording looks good! Thanks Hoss!
          Hide
          Hoss Man added a comment -

          Committed revision 1411996. - trunk
          Committed revision 1411997. - 4x

          Show
          Hoss Man added a comment - Committed revision 1411996. - trunk Committed revision 1411997. - 4x
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1411997

          LUCENE-4489: Added consumeAllTokens option to LimitTokenCountFilter (merge r1411996)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1411997 LUCENE-4489 : Added consumeAllTokens option to LimitTokenCountFilter (merge r1411996)
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1411996

          LUCENE-4489: Added consumeAllTokens option to LimitTokenCountFilter

          Show
          Commit Tag Bot added a comment - [trunk commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1411996 LUCENE-4489 : Added consumeAllTokens option to LimitTokenCountFilter
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1411997

          LUCENE-4489: Added consumeAllTokens option to LimitTokenCountFilter (merge r1411996)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1411997 LUCENE-4489 : Added consumeAllTokens option to LimitTokenCountFilter (merge r1411996)

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development