Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7442

MinHashFilter's ctor should validate its args

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.2.1
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      My Jenkins found this reproducing branch_6x seed:

         [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
         [junit4]   2> Exception from random analyzer: 
         [junit4]   2> charfilters=
         [junit4]   2> tokenizer=
         [junit4]   2>   org.apache.lucene.analysis.standard.StandardTokenizer()
         [junit4]   2> filters=
         [junit4]   2>   org.apache.lucene.analysis.minhash.MinHashFilter(ValidatingTokenFilter@6ae99167 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word, 5, 5, -3, true)
         [junit4]   2>   org.apache.lucene.analysis.bg.BulgarianStemFilter(ValidatingTokenFilter@40844352 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,keyword=false)
         [junit4]   2> offsetsAreCorrect=true
         [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=4733E677EBDC28FC -Dtests.slow=true -Dtests.locale=ar-OM -Dtests.timezone=Atlantic/South_Georgia -Dtests.asserts=true -Dtests.file.encoding=UTF-8
         [junit4] ERROR   3.18s J4 | TestRandomChains.testRandomChainsWithLargeStrings <<<
         [junit4]    > Throwable #1: java.util.NoSuchElementException
         [junit4]    > 	at __randomizedtesting.SeedInfo.seed([4733E677EBDC28FC:2D685966B292080F]:0)
         [junit4]    > 	at java.util.TreeMap.key(TreeMap.java:1323)
         [junit4]    > 	at java.util.TreeMap.lastKey(TreeMap.java:297)
         [junit4]    > 	at java.util.TreeSet.last(TreeSet.java:401)
         [junit4]    > 	at org.apache.lucene.analysis.minhash.MinHashFilter$FixedSizeTreeSet.add(MinHashFilter.java:325)
         [junit4]    > 	at org.apache.lucene.analysis.minhash.MinHashFilter.incrementToken(MinHashFilter.java:159)
         [junit4]    > 	at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
         [junit4]    > 	at org.apache.lucene.analysis.bg.BulgarianStemFilter.incrementToken(BulgarianStemFilter.java:48)
         [junit4]    > 	at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
         [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:405)
         [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:510)
         [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:959)
         [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
         [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62): {dummy=Lucene50(blocksize=128)}, docValues:{}, maxPointsInLeafNode=252, maxMBSortInHeap=5.297834377897023, sim=ClassicSimilarity, locale=ar-OM, timezone=Atlantic/South_Georgia
         [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=395080152,total=465567744
         [junit4]   2> NOTE: All tests run in this JVM: [TestDecimalDigitFilterFactory, TestMultiWordSynonyms, TestReversePathHierarchyTokenizer, TestDoubleEscape, TestHunspellStemFilterFactory, TestArabicNormalizationFilter, TestUAX29URLEmailAnalyzer, TestSwedishLightStemFilterFactory, TestBulgarianStemmer, TestASCIIFoldingFilter, TestDelimitedPayloadTokenFilterFactory, TestIndonesianStemmer, TestCircumfix, EdgeNGramTokenFilterTest, TestPatternTokenizer, TestScandinavianFoldingFilter, TestIgnore, TestRandomChains]
         [junit4] Completed [130/272 (1!)] on J4 in 9.85s, 2 tests, 1 error <<< FAILURES!
      
      1. LUCENE-7442.patch
        1 kB
        Cao Manh Dat
      2. LUCENE-7442.patch
        3 kB
        Cao Manh Dat

        Issue Links

          Activity

          Hide
          caomanhdat Cao Manh Dat added a comment -

          It seem that TestRandomChains init MinHashFilter with wrong parameters

          public MinHashFilter(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation)
          

          hashCount, bucketCount, hashSetSize must be positive ones.

          Here are the patch to fix this issue.

          Show
          caomanhdat Cao Manh Dat added a comment - It seem that TestRandomChains init MinHashFilter with wrong parameters public MinHashFilter(TokenStream input, int hashCount, int bucketCount, int hashSetSize, boolean withRotation) hashCount, bucketCount, hashSetSize must be positive ones. Here are the patch to fix this issue.
          Hide
          steve_rowe Steve Rowe added a comment -

          Hi Dat, thanks for looking into it.

          The correct approach here is to have the MinHashFilter ctor throw IllegalArgumentException for invalid arguments (rather than the asserts you added in your patch) - see LengthFilter for an example.

          That way TestRandomChains.brokenConstructors doesn't need an entry for MinHashFilter, because the loop in MockRandomAnalyzer.newFilterChain() will skip over invalid ctor/args combinations (as communicated via IllegalArgumentException or UnsupportedOperationException) - see MockRandomAnalyzer.createComponent().

          (AFAICT the name "brokenConstructors" is an attempt to convey the fact that the constructors with entries don't/can't properly validate their arguments.)

          Show
          steve_rowe Steve Rowe added a comment - Hi Dat, thanks for looking into it. The correct approach here is to have the MinHashFilter ctor throw IllegalArgumentException for invalid arguments (rather than the asserts you added in your patch) - see LengthFilter for an example. That way TestRandomChains.brokenConstructors doesn't need an entry for MinHashFilter, because the loop in MockRandomAnalyzer.newFilterChain() will skip over invalid ctor/args combinations (as communicated via IllegalArgumentException or UnsupportedOperationException) - see MockRandomAnalyzer.createComponent(). (AFAICT the name "brokenConstructors" is an attempt to convey the fact that the constructors with entries don't/can't properly validate their arguments.)
          Hide
          steve_rowe Steve Rowe added a comment -

          Another failure from my Jenkins caused by MinHashFilter not validating its args:

             [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
             [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=4269222C7AE3CDA1 -Dtests.slow=true -Dtests.locale=lt -Dtests.timezone=America/Fortaleza -Dtests.asserts=true -Dtests.file.encoding=UTF-8
             [junit4] ERROR   0.84s J8 | TestRandomChains.testRandomChainsWithLargeStrings <<<
             [junit4]    > Throwable #1: java.lang.ArithmeticException: / by zero
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([4269222C7AE3CDA1:28329D3D23ADED52]:0)
             [junit4]    > 	at org.apache.lucene.analysis.minhash.MinHashFilter.<init>(MinHashFilter.java:121)
             [junit4]    > 	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains$MockRandomAnalyzer.createComponent(TestRandomChains.java:713)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains$MockRandomAnalyzer.newFilterChain(TestRandomChains.java:823)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains$MockRandomAnalyzer.toString(TestRandomChains.java:702)
             [junit4]    > 	at java.lang.String.valueOf(String.java:2994)
             [junit4]    > 	at java.lang.StringBuilder.append(StringBuilder.java:131)
             [junit4]    > 	at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:962)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> NOTE: test params are: codec=Asserting(Lucene62), sim=ClassicSimilarity, locale=lt, timezone=America/Fortaleza
             [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=382482496,total=519569408
             [junit4]   2> NOTE: All tests run in this JVM: [TestReverseStringFilterFactory, TestLengthFilter, TestGermanStemFilter, TestKStemmer, TestNeedAffix, TestCapitalizationFilter, TestScandinavianNormalizationFilter, TestFlagNum, TestStopAnalyzer, TestLatvianStemFilterFactory, TestGermanLightStemFilter, TestPatternTokenizerFactory, TestDanishAnalyzer, TestHindiFilters, TestUnicodeWhitespaceTokenizer, TestThaiTokenizerFactory, TestThaiAnalyzer, TestRandomChains]
             [junit4] Completed [114/272 (1!)] on J8 in 4.79s, 2 tests, 1 error <<< FAILURES!
          
          Show
          steve_rowe Steve Rowe added a comment - Another failure from my Jenkins caused by MinHashFilter not validating its args: [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=4269222C7AE3CDA1 -Dtests.slow=true -Dtests.locale=lt -Dtests.timezone=America/Fortaleza -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.84s J8 | TestRandomChains.testRandomChainsWithLargeStrings <<< [junit4] > Throwable #1: java.lang.ArithmeticException: / by zero [junit4] > at __randomizedtesting.SeedInfo.seed([4269222C7AE3CDA1:28329D3D23ADED52]:0) [junit4] > at org.apache.lucene.analysis.minhash.MinHashFilter.<init>(MinHashFilter.java:121) [junit4] > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains$MockRandomAnalyzer.createComponent(TestRandomChains.java:713) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains$MockRandomAnalyzer.newFilterChain(TestRandomChains.java:823) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains$MockRandomAnalyzer.toString(TestRandomChains.java:702) [junit4] > at java.lang.String.valueOf(String.java:2994) [junit4] > at java.lang.StringBuilder.append(StringBuilder.java:131) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:962) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene62), sim=ClassicSimilarity, locale=lt, timezone=America/Fortaleza [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=382482496,total=519569408 [junit4] 2> NOTE: All tests run in this JVM: [TestReverseStringFilterFactory, TestLengthFilter, TestGermanStemFilter, TestKStemmer, TestNeedAffix, TestCapitalizationFilter, TestScandinavianNormalizationFilter, TestFlagNum, TestStopAnalyzer, TestLatvianStemFilterFactory, TestGermanLightStemFilter, TestPatternTokenizerFactory, TestDanishAnalyzer, TestHindiFilters, TestUnicodeWhitespaceTokenizer, TestThaiTokenizerFactory, TestThaiAnalyzer, TestRandomChains] [junit4] Completed [114/272 (1!)] on J8 in 4.79s, 2 tests, 1 error <<< FAILURES!
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Hi Steve Rowe, that's a great approach! This is the patch that fix the issue in that way.

          Show
          caomanhdat Cao Manh Dat added a comment - Hi Steve Rowe , that's a great approach! This is the patch that fix the issue in that way.
          Hide
          steve_rowe Steve Rowe added a comment -

          Thanks Cao Manh Dat, both seeds pass with your latest patch. I'll commit now.

          Show
          steve_rowe Steve Rowe added a comment - Thanks Cao Manh Dat , both seeds pass with your latest patch. I'll commit now.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 109ec23426d6d42c7cefd10ad96a56ca504e6a9a in lucene-solr's branch refs/heads/branch_6_2 from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=109ec23 ]

          LUCENE-7442: MinHashFilter's ctor should validate its args

          Show
          jira-bot ASF subversion and git services added a comment - Commit 109ec23426d6d42c7cefd10ad96a56ca504e6a9a in lucene-solr's branch refs/heads/branch_6_2 from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=109ec23 ] LUCENE-7442 : MinHashFilter's ctor should validate its args
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8066a3605ccf4b91ece20810fd435f1b5c6da44f in lucene-solr's branch refs/heads/branch_6_2 from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8066a36 ]

          LUCENE-7442: add changes entry for 6.2.1

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8066a3605ccf4b91ece20810fd435f1b5c6da44f in lucene-solr's branch refs/heads/branch_6_2 from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8066a36 ] LUCENE-7442 : add changes entry for 6.2.1
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 6fb22fcf55ce2883f45da285ee97a05e7a832579 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6fb22fc ]

          LUCENE-7442: MinHashFilter's ctor should validate its args

          Show
          jira-bot ASF subversion and git services added a comment - Commit 6fb22fcf55ce2883f45da285ee97a05e7a832579 in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6fb22fc ] LUCENE-7442 : MinHashFilter's ctor should validate its args
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 36362a2a69a30918d1f6670af208a0801909304f in lucene-solr's branch refs/heads/master from Steve Rowe
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=36362a2 ]

          LUCENE-7442: MinHashFilter's ctor should validate its args

          Show
          jira-bot ASF subversion and git services added a comment - Commit 36362a2a69a30918d1f6670af208a0801909304f in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=36362a2 ] LUCENE-7442 : MinHashFilter's ctor should validate its args
          Hide
          steve_rowe Steve Rowe added a comment -

          Committed to master, branch_6x and branch_6_2 (for inclusion in the 6.2.1 release). Thanks Dat!

          Show
          steve_rowe Steve Rowe added a comment - Committed to master, branch_6x and branch_6_2 (for inclusion in the 6.2.1 release). Thanks Dat!
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Closing after 6.2.1 release

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Closing after 6.2.1 release

            People

            • Assignee:
              steve_rowe Steve Rowe
              Reporter:
              steve_rowe Steve Rowe
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development