Details

      Description

      Some token filters uses the position length attribute of the token stream to encode the number of terms they put in a single token.
      This breaks the query parsing because it creates disconnected graph.
      I've tracked down the abusive case to 2 candidates:

      • ShingleFilter which sets the position length attribute to the length of the shingle.
      • CJKBigramFilter which always sets the position length attribute to 2.

      I don't think these filters should set the position length at all so the best would be to remove the attribute from these token filters but this could break BWC.
      Though this is a serious bug since shingles and cjk bigram now produce invalid queries.

      1. LUCENE-7708.patch
        6 kB
        Jim Ferenczi
      2. LUCENE-7708.patch
        6 kB
        Jim Ferenczi

        Activity

        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        The CJKBigramFilter is working correctly because it sets the position length attribute only if outputUnigrams is set.
        So only the ShingleFilter is problematic since outputUnigrams is not check when position length is set.
        So for instance with shingles of size 2, the input "foo bar baz" would create two tokens "foo bar" and "bar baz" with a pos len of 2 and an position increment in between which forms a disconnected graph.
        I'll work on a patch shortly.

        Show
        jim.ferenczi Jim Ferenczi added a comment - The CJKBigramFilter is working correctly because it sets the position length attribute only if outputUnigrams is set. So only the ShingleFilter is problematic since outputUnigrams is not check when position length is set. So for instance with shingles of size 2, the input "foo bar baz" would create two tokens "foo bar" and "bar baz" with a pos len of 2 and an position increment in between which forms a disconnected graph. I'll work on a patch shortly.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Here is one patch for the ShingleFilter.
        When outputUnigrams is set to false, position length for a shingle of size N is the number of position created by shingles of smaller size: (N - minShingleSize) + 1.
        Michael McCandless can you take a look ?

        Show
        jim.ferenczi Jim Ferenczi added a comment - Here is one patch for the ShingleFilter. When outputUnigrams is set to false, position length for a shingle of size N is the number of position created by shingles of smaller size: (N - minShingleSize) + 1. Michael McCandless can you take a look ?
        Hide
        steve_rowe Steve Rowe added a comment -

        +1 to the idea, but some tests are failing with the patch:

           [junit4] Tests with failures [seed: 4D8AED66905F8617]:
           [junit4]   - org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesSingleTokenCase
           [junit4]   - org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesWithMultipleInputTokens
           [junit4]   - org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapperTest.testOutputUnigramsIfNoShinglesSingleToken
           [junit4]   - org.apache.lucene.analysis.shingle.TestShingleFilterFactory.testOutputUnigramsIfNoShingles
        
        Show
        steve_rowe Steve Rowe added a comment - +1 to the idea, but some tests are failing with the patch: [junit4] Tests with failures [seed: 4D8AED66905F8617]: [junit4] - org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesSingleTokenCase [junit4] - org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesWithMultipleInputTokens [junit4] - org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapperTest.testOutputUnigramsIfNoShinglesSingleToken [junit4] - org.apache.lucene.analysis.shingle.TestShingleFilterFactory.testOutputUnigramsIfNoShingles
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks Steve !
        I pushed a new patch that solves the tests failures.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks Steve ! I pushed a new patch that solves the tests failures.
        Hide
        steve_rowe Steve Rowe added a comment - - edited

        I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 found the following reproducing seed - maybe it's ShingleFilter's fault? (I didn't investigate further):

        edit: this seed fails on unpatched master, so the patch on this issue isn't to blame. I created a separate issue: LUCENE-7711

          [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
           [junit4]   2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H \ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g '
           [junit4]   2> Exception from random analyzer: 
           [junit4]   2> charfilters=
           [junit4]   2>   org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e, [<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>])
           [junit4]   2> tokenizer=
           [junit4]   2>   org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b)
           [junit4]   2> filters=
           [junit4]   2>   org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
           [junit4]   2>   org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
           [junit4]   2>   org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false, 49)
           [junit4]   2>   org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
           [junit4]   2> offsetsAreCorrect=false
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true -Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] ERROR   0.76s | TestRandomChains.testRandomChains <<<
           [junit4]    > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=10,endOffset=9
           [junit4]    >        at __randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0)
           [junit4]    >        at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110)
           [junit4]    >        at org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345)
           [junit4]    >        at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
           [junit4]    >        at org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40)
           [junit4]    >        at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
           [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731)
           [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
           [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
           [junit4]    >        at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853)
           [junit4]    >        at java.lang.Thread.run(Thread.java:745)
           [junit4] OK      1.64s | TestRandomChains.testRandomChainsWithLargeStrings
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, timezone=Atlantic/Jan_Mayen
           [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816
           [junit4]   2> NOTE: All tests run in this JVM: [TestRandomChains]
           [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES!
        
        Show
        steve_rowe Steve Rowe added a comment - - edited I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 found the following reproducing seed - maybe it's ShingleFilter's fault? (I didn't investigate further): edit : this seed fails on unpatched master, so the patch on this issue isn't to blame. I created a separate issue: LUCENE-7711 [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H \ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g ' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e, [<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>]) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b) [junit4] 2> filters= [junit4] 2> org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false, 49) [junit4] 2> org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> offsetsAreCorrect=false [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true -Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.76s | TestRandomChains.testRandomChains <<< [junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=10,endOffset=9 [junit4] > at __randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0) [junit4] > at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110) [junit4] > at org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] OK 1.64s | TestRandomChains.testRandomChainsWithLargeStrings [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, timezone=Atlantic/Jan_Mayen [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816 [junit4] 2> NOTE: All tests run in this JVM: [TestRandomChains] [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES!
        Hide
        mikemccand Michael McCandless added a comment -

        +1, thanks Jim Ferenczi!

        Show
        mikemccand Michael McCandless added a comment - +1, thanks Jim Ferenczi !
        Hide
        steve_rowe Steve Rowe added a comment -

        +1, LGTM, all lucene/analysis/common/ tests pass for me with the latest patch.

        Also, 1000 beasting iterations of TestRandomChains didn't trigger any failures with this patch (other than the unrelated one at LUCENE-7711).

        Show
        steve_rowe Steve Rowe added a comment - +1, LGTM, all lucene/analysis/common/ tests pass for me with the latest patch. Also, 1000 beasting iterations of TestRandomChains didn't trigger any failures with this patch (other than the unrelated one at LUCENE-7711 ).
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 57a42e4ec54aebac40c1ef7dc93d933cd00dbe1e in lucene-solr's branch refs/heads/master from Jim Ferenczi
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=57a42e4 ]

        LUCENE-7708: Fix position length attribute set by the ShingleFilter when outputUnigrams=false

        Show
        jira-bot ASF subversion and git services added a comment - Commit 57a42e4ec54aebac40c1ef7dc93d933cd00dbe1e in lucene-solr's branch refs/heads/master from Jim Ferenczi [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=57a42e4 ] LUCENE-7708 : Fix position length attribute set by the ShingleFilter when outputUnigrams=false
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 6c63df0b15f735907438514f3b4b702680d74588 in lucene-solr's branch refs/heads/branch_6x from Jim Ferenczi
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c63df0 ]

        LUCENE-7708: Fix position length attribute set by the ShingleFilter when outputUnigrams=false

        Show
        jira-bot ASF subversion and git services added a comment - Commit 6c63df0b15f735907438514f3b4b702680d74588 in lucene-solr's branch refs/heads/branch_6x from Jim Ferenczi [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c63df0 ] LUCENE-7708 : Fix position length attribute set by the ShingleFilter when outputUnigrams=false
        Hide
        jim.ferenczi Jim Ferenczi added a comment - - edited
        Show
        jim.ferenczi Jim Ferenczi added a comment - - edited Thanks Steve Rowe and Michael McCandless !
        Hide
        elyograg Shawn Heisey added a comment -

        There's no fix version here. CHANGES.txt says it's in 6.5.0.

        (looking for possible causes of a shingle filter problem confirmed in Solr 6.3 and 6.4, this couldn't be the cause)

        Show
        elyograg Shawn Heisey added a comment - There's no fix version here. CHANGES.txt says it's in 6.5.0. (looking for possible causes of a shingle filter problem confirmed in Solr 6.3 and 6.4, this couldn't be the cause)
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Shawn Heisey one shingle filter problem is fixed in LUCENE-7708 and appears in 6.3 when the support for graph analysis has been added to the QueryBuilder.
        The other shingle filter problem I can think of is when the number of paths is gigantic and produces an OOM. I opened LUCENE-7747 to fix this.
        Although I think that the workaround for now is to be disable graph query analysis when the analyzer contains a shingle filter that produces shingles of different size. The graph analysis in this case builds all possible path since each position has different side paths.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Shawn Heisey one shingle filter problem is fixed in LUCENE-7708 and appears in 6.3 when the support for graph analysis has been added to the QueryBuilder. The other shingle filter problem I can think of is when the number of paths is gigantic and produces an OOM. I opened LUCENE-7747 to fix this. Although I think that the workaround for now is to be disable graph query analysis when the analyzer contains a shingle filter that produces shingles of different size. The graph analysis in this case builds all possible path since each position has different side paths.
        Hide
        dsmiley David Smiley added a comment -

        Jim Ferenczi what we do after committing/all-done is "Resolve" the issue (not "Close"). That dialog box will give you the option to set the fix-version. Later on during the release process, there should be a JIRA step that involves bulk-closing all issues resolved for this version.

        Show
        dsmiley David Smiley added a comment - Jim Ferenczi what we do after committing/all-done is "Resolve" the issue (not "Close"). That dialog box will give you the option to set the fix-version. Later on during the release process, there should be a JIRA step that involves bulk-closing all issues resolved for this version.
        Hide
        jim.ferenczi Jim Ferenczi added a comment -

        Thanks David Smiley. I updated the status.

        Show
        jim.ferenczi Jim Ferenczi added a comment - Thanks David Smiley . I updated the status.
        Hide
        elyograg Shawn Heisey added a comment -

        Looks like 6.5.0 isn't a valid version yet. Easy enough to add, but if I do so, would I be doing the right thing?

        Show
        elyograg Shawn Heisey added a comment - Looks like 6.5.0 isn't a valid version yet. Easy enough to add, but if I do so, would I be doing the right thing?
        Hide
        steve_rowe Steve Rowe added a comment -

        Looks like 6.5.0 isn't a valid version yet. Easy enough to add, but if I do so, would I be doing the right thing?

        I see Jim already set the version to 6.5, but FYI Shawn Heisey, historically people have excluded the trailing ".0" in minor release labels here on JIRA.

        Show
        steve_rowe Steve Rowe added a comment - Looks like 6.5.0 isn't a valid version yet. Easy enough to add, but if I do so, would I be doing the right thing? I see Jim already set the version to 6.5, but FYI Shawn Heisey , historically people have excluded the trailing ".0" in minor release labels here on JIRA.

          People

          • Assignee:
            Unassigned
            Reporter:
            jim.ferenczi Jim Ferenczi
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development