Solr
  1. Solr
  2. SOLR-3961

LimitTokenCountFilterFactory config parsing is totally broken

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      As noted on the mailing list, LimitTokenCountFilterFactory throws a NumberFormatException because it tries to use the value of it's config param as a key to look up another param that it parses as an integer ... totally ridiculous.

      1. SOLR-3961.patch
        5 kB
        Hoss Man
      2. SOLR-3961.patch
        5 kB
        Hoss Man

        Activity

        Hide
        Hoss Man added a comment -

        From the list...

        I read the following in the example solrconfig:
        
         <!-- maxFieldLength was removed in 4.0. To get similar behavior, include a
                 LimitTokenCountFilterFactory in your fieldType definition. E.g.
             <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10000"/>
            -->
        
        I tried that as follows:
        
        ...
        <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
              <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LimitTokenCountFilterFactory"
        maxTokenCount="100000"/>
                <filter class="solr.WordDelimiterFilterFactory"
        generateWordParts="1" generateNumberParts="1" catenateWords="0"
        catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" language="German"
        />
                <filter class="solr.StopFilterFactory" ignoreCase="true"
        words="stopwords.txt" enablePositionIncrements="true" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
              </analyzer>
        ...
        
        The LimitTokenCountFilterFactory configured like that crashes the startup
        of the corresponding core with the following exception (without the Factory
        the core startup works):
        
        
        17.10.2012 17:44:19 org.apache.solr.common.SolrException log
        SCHWERWIEGEND: null:org.apache.solr.common.SolrException: Plugin init
        failure for [schema.xml] fieldType "textgen": Plugin init failure for
        [schema.xml] analyze
        r/filter: null
                at
        org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)
                at
        ...
        

        And as Jack noted...

        Anybody want to guess what's wrong with this code:
        
        String maxTokenCountArg = args.get("maxTokenCount");
        if (maxTokenCountArg == null) {
         throw new IllegalArgumentException("maxTokenCount is mandatory.");
        }
        maxTokenCount = Integer.parseInt(args.get(maxTokenCountArg));
        
        Hmmm... try this "workaround":
        
            <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="foo" foo="10000"/>
        
        Show
        Hoss Man added a comment - From the list... I read the following in the example solrconfig: <!-- maxFieldLength was removed in 4.0. To get similar behavior, include a LimitTokenCountFilterFactory in your fieldType definition. E.g. <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="10000"/> --> I tried that as follows: ... <fieldType name="textgen" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="100000"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German" /> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> ... The LimitTokenCountFilterFactory configured like that crashes the startup of the corresponding core with the following exception (without the Factory the core startup works): 17.10.2012 17:44:19 org.apache.solr.common.SolrException log SCHWERWIEGEND: null:org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType "textgen": Plugin init failure for [schema.xml] analyze r/filter: null at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177) at ... And as Jack noted... Anybody want to guess what's wrong with this code: String maxTokenCountArg = args.get("maxTokenCount"); if (maxTokenCountArg == null) { throw new IllegalArgumentException("maxTokenCount is mandatory."); } maxTokenCount = Integer.parseInt(args.get(maxTokenCountArg)); Hmmm... try this "workaround": <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="foo" foo="10000"/>
        Hide
        Hoss Man added a comment -

        i modified one of the test schemas to include LImitTOkenCountFilterFactory to quickly demonstrate the init failure, then banged out a quick fix to the factory. for added measure, i added a unit test of the factory itself, and seem to have uncovered another bug somewhere – although i'm not yet clear if it's in the Filter (violating the TokenStream contract) or in BaseTokenStreamTestCase (trying to enforce the contract)...

        [junit4:junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestLimitTokenCountFilterFactory -Dtests.method=test -Dtests.seed=D40B43DE6627A1AB -Dtests.slow=true -Dtests.locale=mk -Dtests.timezone=Singapore -Dtests.file.encoding=US-ASCII
        [junit4:junit4] FAILURE 0.13s | TestLimitTokenCountFilterFactory.test <<<
        [junit4:junit4]    > Throwable #1: java.lang.AssertionError: end() called before incrementToken() returned false!
        [junit4:junit4]    > 	at __randomizedtesting.SeedInfo.seed([D40B43DE6627A1AB:5C5F7C04C8DBCC53]:0)
        [junit4:junit4]    > 	at org.apache.lucene.analysis.MockTokenizer.end(MockTokenizer.java:243)
        [junit4:junit4]    > 	at org.apache.lucene.analysis.TokenFilter.end(TokenFilter.java:46)
        [junit4:junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:239)
        [junit4:junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:250)
        [junit4:junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:262)
        [junit4:junit4]    > 	at org.apache.lucene.analysis.miscellaneous.TestLimitTokenCountFilterFactory.test(TestLimitTokenCountFilterFactory.java:37)
        

        I'm going to need another set of eyeballs on this.

        Show
        Hoss Man added a comment - i modified one of the test schemas to include LImitTOkenCountFilterFactory to quickly demonstrate the init failure, then banged out a quick fix to the factory. for added measure, i added a unit test of the factory itself, and seem to have uncovered another bug somewhere – although i'm not yet clear if it's in the Filter (violating the TokenStream contract) or in BaseTokenStreamTestCase (trying to enforce the contract)... [junit4:junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestLimitTokenCountFilterFactory -Dtests.method=test -Dtests.seed=D40B43DE6627A1AB -Dtests.slow=true -Dtests.locale=mk -Dtests.timezone=Singapore -Dtests.file.encoding=US-ASCII [junit4:junit4] FAILURE 0.13s | TestLimitTokenCountFilterFactory.test <<< [junit4:junit4] > Throwable #1: java.lang.AssertionError: end() called before incrementToken() returned false! [junit4:junit4] > at __randomizedtesting.SeedInfo.seed([D40B43DE6627A1AB:5C5F7C04C8DBCC53]:0) [junit4:junit4] > at org.apache.lucene.analysis.MockTokenizer.end(MockTokenizer.java:243) [junit4:junit4] > at org.apache.lucene.analysis.TokenFilter.end(TokenFilter.java:46) [junit4:junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:239) [junit4:junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:250) [junit4:junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:262) [junit4:junit4] > at org.apache.lucene.analysis.miscellaneous.TestLimitTokenCountFilterFactory.test(TestLimitTokenCountFilterFactory.java:37) I'm going to need another set of eyeballs on this.
        Hide
        Robert Muir added a comment -

        For testing LimitTokenCountFilter, you need to set http://lucene.apache.org/core/4_0_0/test-framework/org/apache/lucene/analysis/MockTokenizer.html#setEnableChecks%28boolean%29

        Thats because LimitTokenCountFilter violates the workflow of the standard tokenstream API (http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html)... it cuts off its consumer and calls end() even while that consumer still has more tokens: this could easily cause unexpected side effects and bugs.

        But we knew about this anyway: existing tests for LimitTokenCountFilter already do setEnableChecks(false) for this exact reason... this is just an explanation of why.

        Show
        Robert Muir added a comment - For testing LimitTokenCountFilter, you need to set http://lucene.apache.org/core/4_0_0/test-framework/org/apache/lucene/analysis/MockTokenizer.html#setEnableChecks%28boolean%29 Thats because LimitTokenCountFilter violates the workflow of the standard tokenstream API ( http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html )... it cuts off its consumer and calls end() even while that consumer still has more tokens: this could easily cause unexpected side effects and bugs. But we knew about this anyway: existing tests for LimitTokenCountFilter already do setEnableChecks(false) for this exact reason... this is just an explanation of why.
        Hide
        Hoss Man added a comment -

        Thanks rmuir, setEnableChecks is the piece i was missing and using it makes the test pass, but i'm still a little confused by the rest of your comment and what i'm seeing in the tests...

        But we knew about this anyway: existing tests for LimitTokenCountFilter already do setEnableChecks(false) for this exact reason... this is just an explanation of why.

        1) HunspellStemFilterTest is the only lucene/analysis test i see using setEnableChecks (although there do seem to be some highlighter tests that use it, and TestIndexWriterExceptions uses it to ignore secondary problems since it's going out of it's way to force exceptions)

        2) i don't see any existing tests for LimitTokenCountFilter .. were they deleted by mistake?

        3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer - i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use MockTokenizer?

        4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add the exact number of tokens that the limit is configured for, and then asserts that the last token is in the index - but never actaully triggers the limiting logic since exactly the allowed umber of tokens are used.

        Show
        Hoss Man added a comment - Thanks rmuir, setEnableChecks is the piece i was missing and using it makes the test pass, but i'm still a little confused by the rest of your comment and what i'm seeing in the tests... But we knew about this anyway: existing tests for LimitTokenCountFilter already do setEnableChecks(false) for this exact reason... this is just an explanation of why. 1) HunspellStemFilterTest is the only lucene/analysis test i see using setEnableChecks (although there do seem to be some highlighter tests that use it, and TestIndexWriterExceptions uses it to ignore secondary problems since it's going out of it's way to force exceptions) 2) i don't see any existing tests for LimitTokenCountFilter .. were they deleted by mistake? 3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer - i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use MockTokenizer? 4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add the exact number of tokens that the limit is configured for, and then asserts that the last token is in the index - but never actaully triggers the limiting logic since exactly the allowed umber of tokens are used.
        Hide
        Robert Muir added a comment -

        HunspellStemFilterTest is the only lucene/analysis test i see using setEnableChecks

        It sets it to true, which is dead code (true is the default!).

        although there do seem to be some highlighter tests that use it

        Highlighter has a built-in limiter, that limits not based on tokencount, but accumulated # of analyzed chars.
        So it disables it for the same reason as LimitTokenCount does or should

        2) i don't see any existing tests for LimitTokenCountFilter .. were they deleted by mistake?

        I think these are in TestLimitTokenCountAnalyzer? For lucene users this is the way you use this (just wrap your analyzer).

        3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer - i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use MockTokenizer?

        I think we should always do this!

        4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add the exact number of tokens that the limit is configured for, and then asserts that the last token is in the index - but never actaully triggers the limiting logic since exactly the allowed umber of tokens are used.

        Then thats fine, because when LimitTokenCountFilter consumes the whole stream, its a "good consumer". its only when it actually truncates that it breaks the tokenstream contract.

        Show
        Robert Muir added a comment - HunspellStemFilterTest is the only lucene/analysis test i see using setEnableChecks It sets it to true, which is dead code (true is the default!). although there do seem to be some highlighter tests that use it Highlighter has a built-in limiter, that limits not based on tokencount, but accumulated # of analyzed chars. So it disables it for the same reason as LimitTokenCount does or should 2) i don't see any existing tests for LimitTokenCountFilter .. were they deleted by mistake? I think these are in TestLimitTokenCountAnalyzer? For lucene users this is the way you use this (just wrap your analyzer). 3) the closest thing i see to a test of LimitTokenCountFilter is TestLimitTokenCountAnalyzer - i realize now the reason it's testLimitTokenCountAnalyzer doesn't get the same failure is because it's wrapping WhitespaceAnalyzer, StandardAnalyzer - should those be changed to use MockTokenizer? I think we should always do this! 4) TestLimitTokenCountAnalyzer also has a testLimitTokenCountIndexWriter that uses MockAnalyzer w/o calling setEnableChecks(false) which seems like it should trigger the same failure i got since it uses MockTokenizer, but in general that test looks suspicious, as it seems to add the exact number of tokens that the limit is configured for, and then asserts that the last token is in the index - but never actaully triggers the limiting logic since exactly the allowed umber of tokens are used. Then thats fine, because when LimitTokenCountFilter consumes the whole stream, its a "good consumer". its only when it actually truncates that it breaks the tokenstream contract.
        Hide
        Hoss Man added a comment -

        I spun off LUCENE-4489 for some of the remaining questions about testingLimitTokenCountFilter since it's really only tangentially related to this issue – In the meantime i want to commit a fix for this factory bug.

        Committed revision 1399474. - trunk
        Committed revision 1399482. - branch_4x

        ...i'm a little unclear as to how we're dealing with possible 4.0.1 fixes, so leaving this issue open for now until i figure it out.

        (BTW: Thanks for figuring out this problem Jack!)

        Show
        Hoss Man added a comment - I spun off LUCENE-4489 for some of the remaining questions about testingLimitTokenCountFilter since it's really only tangentially related to this issue – In the meantime i want to commit a fix for this factory bug. Committed revision 1399474. - trunk Committed revision 1399482. - branch_4x ...i'm a little unclear as to how we're dealing with possible 4.0.1 fixes, so leaving this issue open for now until i figure it out. (BTW: Thanks for figuring out this problem Jack!)
        Hide
        Steve Rowe added a comment -

        Hoss, I don't think there will be a 4.0.1 release - can this be resolved as fixed?

        Show
        Steve Rowe added a comment - Hoss, I don't think there will be a 4.0.1 release - can this be resolved as fixed?
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Chris M. Hostetter
        http://svn.apache.org/viewvc?view=revision&revision=1399482

        SOLR-3961: Fixed error using LimitTokenCountFilterFactory (merge r1399474)

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1399482 SOLR-3961 : Fixed error using LimitTokenCountFilterFactory (merge r1399474)

          People

          • Assignee:
            Hoss Man
            Reporter:
            Hoss Man
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development