Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.5.1, 4.6, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      One of EdgeNGramTokenizer, ShingleFilter, NGramTokenFilter is buggy, or possibly only the combination of them conspiring together.

      1. LUCENE-5269_test.patch
        5 kB
        Robert Muir
      2. LUCENE-5269_test.patch
        5 kB
        Robert Muir
      3. LUCENE-5269_test.patch
        4 kB
        Robert Muir
      4. LUCENE-5269.patch
        24 kB
        Robert Muir
      5. LUCENE-5269.patch
        9 kB
        Robert Muir
      6. LUCENE-5269.patch
        6 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Here's a test. For whatever reason the exact text in jenkins wouldnt reproduce with checkAnalysisConsistency with the exact configuration.

        However the random seed reproduces in jenkins easily. I suspect maybe there is something not reset and the linedocs file is triggering it???

        If i blast random data at the configuration it fails the same way.

        I then removed various harmless filters and so on until I was left with these three and it was still failing...

        Show
        Robert Muir added a comment - Here's a test. For whatever reason the exact text in jenkins wouldnt reproduce with checkAnalysisConsistency with the exact configuration. However the random seed reproduces in jenkins easily. I suspect maybe there is something not reset and the linedocs file is triggering it??? If i blast random data at the configuration it fails the same way. I then removed various harmless filters and so on until I was left with these three and it was still failing...
        Hide
        Robert Muir added a comment -

        extremely noisy version of the same test

        Show
        Robert Muir added a comment - extremely noisy version of the same test
        Hide
        Robert Muir added a comment -

        with SopFilter 2.0

        Show
        Robert Muir added a comment - with SopFilter 2.0
        Hide
        Robert Muir added a comment -

        Now i see stuff like this:

        EdgeNGramTokenizer.reset()
        ShingleFilter.reset()
        NGramTokenFilter.reset()
        EdgeNGramTokenizer->term=β₯ž ,bytes=[e2 a5 9e 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=2,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž 𐀋,bytes=[e2 a5 9e 20 f0 90 a4 8b],positionIncrement=1,positionLength=1,startOffset=0,endOffset=4,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f],positionIncrement=1,positionLength=1,startOffset=0,endOffset=6,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=7,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ x,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78],positionIncrement=1,positionLength=1,startOffset=0,endOffset=8,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xq,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71],positionIncrement=1,positionLength=1,startOffset=0,endOffset=9,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqx,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78],positionIncrement=1,positionLength=1,startOffset=0,endOffset=10,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqxp,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70],positionIncrement=1,positionLength=1,startOffset=0,endOffset=11,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqxp ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=12,type=word,clearCalled=true
        EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqxp ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20 16],positionIncrement=1,positionLength=1,startOffset=0,endOffset=13,type=word,clearCalled=true
        EdgeNGramTokenizer.end()
        ShingleFilter->term=β₯ž ,bytes=[e2 a5 9e 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=2,type=word,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b],positionIncrement=0,positionLength=2,startOffset=0,endOffset=4,type=shingle,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f],positionIncrement=0,positionLength=3,startOffset=0,endOffset=6,type=shingle,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ ,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20],positionIncrement=0,positionLength=4,startOffset=0,endOffset=7,type=shingle,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ  β₯ž π€‹π€Ÿ x,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78],positionIncrement=0,positionLength=5,startOffset=0,endOffset=8,type=shingle,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ  β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71],positionIncrement=0,positionLength=6,startOffset=0,endOffset=9,type=shingle,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ  β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78],positionIncrement=0,positionLength=7,startOffset=0,endOffset=10,type=shingle,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ  β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx β₯ž π€‹π€Ÿ xqxp,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70],positionIncrement=0,positionLength=8,startOffset=0,endOffset=11,type=shingle,clearCalled=true
        ShingleFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ  β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx β₯ž π€‹π€Ÿ xqxp β₯ž π€‹π€Ÿ xqxp ,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20],positionIncrement=0,positionLength=9,startOffset=0,endOffset=12,type=shingle,clearCalled=true
        NGramTokenFilter->term=β₯ž  β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ  β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx β₯ž π€‹π€Ÿ xqxp β₯ž 𐀋,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20 e2 a5 9e 20 f0 90 a4 8b],positionIncrement=0,positionLength=9,startOffset=0,endOffset=12,type=word,clearCalled=true
        TEST FAIL: useCharFilter=false text='\u295e \u1090b\u1091f xqxp \u0016'
        
        java.lang.AssertionError: first posIncrement must be >= 1
        	at __randomizedtesting.SeedInfo.seed([6CC8BD35A010E1FF:714032FA1B8FBB60]:0)
        
        Show
        Robert Muir added a comment - Now i see stuff like this: EdgeNGramTokenizer.reset() ShingleFilter.reset() NGramTokenFilter.reset() EdgeNGramTokenizer->term=β₯ž ,bytes=[e2 a5 9e 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=2,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž 𐀋,bytes=[e2 a5 9e 20 f0 90 a4 8b],positionIncrement=1,positionLength=1,startOffset=0,endOffset=4,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f],positionIncrement=1,positionLength=1,startOffset=0,endOffset=6,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=7,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ x,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78],positionIncrement=1,positionLength=1,startOffset=0,endOffset=8,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xq,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71],positionIncrement=1,positionLength=1,startOffset=0,endOffset=9,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqx,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78],positionIncrement=1,positionLength=1,startOffset=0,endOffset=10,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqxp,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70],positionIncrement=1,positionLength=1,startOffset=0,endOffset=11,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqxp ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=12,type=word,clearCalled=true EdgeNGramTokenizer->term=β₯ž π€‹π€Ÿ xqxp ,bytes=[e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20 16],positionIncrement=1,positionLength=1,startOffset=0,endOffset=13,type=word,clearCalled=true EdgeNGramTokenizer.end() ShingleFilter->term=β₯ž ,bytes=[e2 a5 9e 20],positionIncrement=1,positionLength=1,startOffset=0,endOffset=2,type=word,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b],positionIncrement=0,positionLength=2,startOffset=0,endOffset=4,type=shingle,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f],positionIncrement=0,positionLength=3,startOffset=0,endOffset=6,type=shingle,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ ,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20],positionIncrement=0,positionLength=4,startOffset=0,endOffset=7,type=shingle,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ x,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78],positionIncrement=0,positionLength=5,startOffset=0,endOffset=8,type=shingle,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71],positionIncrement=0,positionLength=6,startOffset=0,endOffset=9,type=shingle,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78],positionIncrement=0,positionLength=7,startOffset=0,endOffset=10,type=shingle,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx β₯ž π€‹π€Ÿ xqxp,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70],positionIncrement=0,positionLength=8,startOffset=0,endOffset=11,type=shingle,clearCalled=true ShingleFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx β₯ž π€‹π€Ÿ xqxp β₯ž π€‹π€Ÿ xqxp ,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20],positionIncrement=0,positionLength=9,startOffset=0,endOffset=12,type=shingle,clearCalled=true NGramTokenFilter->term=β₯ž β₯ž 𐀋 β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ β₯ž π€‹π€Ÿ x β₯ž π€‹π€Ÿ xq β₯ž π€‹π€Ÿ xqx β₯ž π€‹π€Ÿ xqxp β₯ž 𐀋,bytes=[e2 a5 9e 20 20 e2 a5 9e 20 f0 90 a4 8b 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 20 e2 a5 9e 20 f0 90 a4 8b f0 90 a4 9f 20 78 71 78 70 20 e2 a5 9e 20 f0 90 a4 8b],positionIncrement=0,positionLength=9,startOffset=0,endOffset=12,type=word,clearCalled=true TEST FAIL: useCharFilter=false text='\u295e \u1090b\u1091f xqxp \u0016' java.lang.AssertionError: first posIncrement must be >= 1 at __randomizedtesting.SeedInfo.seed([6CC8BD35A010E1FF:714032FA1B8FBB60]:0)
        Hide
        Robert Muir added a comment -

        Mike spotted the bug. Here is a hack patch.

        I will add optimization, tests, and factory.

        Show
        Robert Muir added a comment - Mike spotted the bug. Here is a hack patch. I will add optimization, tests, and factory.
        Hide
        Robert Muir added a comment -

        with 'svn add'

        Show
        Robert Muir added a comment - with 'svn add'
        Hide
        Robert Muir added a comment -

        Cleaned up patch.

        I also tried to enhance ngrams tests in general (these filters had offsets checks disabled, always hardcoded certain parameters, etc).

        @jpountz was this intentional? Can you review if you get a chance?

        Show
        Robert Muir added a comment - Cleaned up patch. I also tried to enhance ngrams tests in general (these filters had offsets checks disabled, always hardcoded certain parameters, etc). @jpountz was this intentional? Can you review if you get a chance?
        Hide
        Adrien Grand added a comment -

        Good catch. This was definitely not intentional, thanks for fixing those tests!

        Patch looks good to me!

        Show
        Adrien Grand added a comment - Good catch. This was definitely not intentional, thanks for fixing those tests! Patch looks good to me!
        Hide
        ASF subversion and git services added a comment -

        Commit 1531186 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1531186 ]

        LUCENE-5269: Fix NGramTokenFilter length filtering

        Show
        ASF subversion and git services added a comment - Commit 1531186 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1531186 ] LUCENE-5269 : Fix NGramTokenFilter length filtering
        Hide
        Robert Muir added a comment -

        The test needs some improvement... after backporting i ran tests about 30 times, and I hit this one:

        ant test -Dtestcase=TestBugInSomething -Dtests.method=testUnicodeShinglesAndNgrams -Dtests.seed=1BFA8BADE39EDF70 -Dtests.slow=true -Dtests.locale=th_TH_TH_#u-nu-thai -Dtests.timezone=Europe/Copenhagen -Dtests.file.encoding=US-ASCII

           [junit4] Suite: org.apache.lucene.analysis.core.TestBugInSomething
           [junit4]   2> TEST FAIL: useCharFilter=true text='ike to thank the rap'
           [junit4]   2> ?.?. ??, ???? ?:??:?? ?????????? com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
           [junit4]   2> WARNING: Uncaught exception in thread: Thread[Thread-2,5,TGRP-TestBugInSomething]
           [junit4]   2> java.lang.OutOfMemoryError: GC overhead limit exceeded
           [junit4]   2> 	at __randomizedtesting.SeedInfo.seed([1BFA8BADE39EDF70]:0)
           [junit4]   2> 	at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.toString(CharTermAttributeImpl.java:269)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:696)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:605)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:57)
           [junit4]   2> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:476)
           [junit4]   2> 
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBugInSomething -Dtests.method=testUnicodeShinglesAndNgrams -Dtests.seed=1BFA8BADE39EDF70 -Dtests.slow=true -Dtests.locale=th_TH_TH_#u-nu-thai -Dtests.timezone=Europe/Copenhagen -Dtests.file.encoding=US-ASCII
           [junit4] ERROR   30.6s | TestBugInSomething.testUnicodeShinglesAndNgrams <<<
           [junit4]    > Throwable #1: java.lang.RuntimeException: some thread(s) failed
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:526)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:428)
           [junit4]    > 	at org.apache.lucene.analysis.core.TestBugInSomething.testUnicodeShinglesAndNgrams(TestBugInSomething.java:255)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:724)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=12, name=Thread-2, state=RUNNABLE, group=TGRP-TestBugInSomething]
           [junit4]    > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([1BFA8BADE39EDF70]:0)
           [junit4]    > 	at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.toString(CharTermAttributeImpl.java:269)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:696)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:605)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:57)
           [junit4]    > 	at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:476)
           [junit4]   2> NOTE: test params are: codec=DummyCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=DUMMY, chunkSize=313), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=DUMMY, chunkSize=313)), sim=RandomSimilarityProvider(queryNorm=true,coord=crazy): {}, locale=th_TH_TH_#u-nu-thai, timezone=Europe/Copenhagen
           [junit4]   2> NOTE: Linux 3.5.0-27-generic amd64/Oracle Corporation 1.7.0_25 (64-bit)/cpus=8,threads=1,free=155107808,total=477233152
           [junit4]   2> NOTE: All tests run in this JVM: [TestBugInSomething]
           [junit4] Completed in 30.92s, 1 test, 1 error <<< FAILURES!
           [junit4] 
           [junit4] 
           [junit4] Tests with failures:
           [junit4]   - org.apache.lucene.analysis.core.TestBugInSomething.testUnicodeShinglesAndNgrams
        

        I will see if i can make a less-ridiculous version of the test that still fails with the bug.

        Show
        Robert Muir added a comment - The test needs some improvement... after backporting i ran tests about 30 times, and I hit this one: ant test -Dtestcase=TestBugInSomething -Dtests.method=testUnicodeShinglesAndNgrams -Dtests.seed=1BFA8BADE39EDF70 -Dtests.slow=true -Dtests.locale=th_TH_TH_#u-nu-thai -Dtests.timezone=Europe/Copenhagen -Dtests.file.encoding=US-ASCII [junit4] Suite: org.apache.lucene.analysis.core.TestBugInSomething [junit4] 2> TEST FAIL: useCharFilter=true text='ike to thank the rap' [junit4] 2> ?.?. ??, ???? ?:??:?? ?????????? com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException [junit4] 2> WARNING: Uncaught exception in thread: Thread[Thread-2,5,TGRP-TestBugInSomething] [junit4] 2> java.lang.OutOfMemoryError: GC overhead limit exceeded [junit4] 2> at __randomizedtesting.SeedInfo.seed([1BFA8BADE39EDF70]:0) [junit4] 2> at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.toString(CharTermAttributeImpl.java:269) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:696) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:605) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:57) [junit4] 2> at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:476) [junit4] 2> [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestBugInSomething -Dtests.method=testUnicodeShinglesAndNgrams -Dtests.seed=1BFA8BADE39EDF70 -Dtests.slow=true -Dtests.locale=th_TH_TH_#u-nu-thai -Dtests.timezone=Europe/Copenhagen -Dtests.file.encoding=US-ASCII [junit4] ERROR 30.6s | TestBugInSomething.testUnicodeShinglesAndNgrams <<< [junit4] > Throwable #1: java.lang.RuntimeException: some thread(s) failed [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:526) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:428) [junit4] > at org.apache.lucene.analysis.core.TestBugInSomething.testUnicodeShinglesAndNgrams(TestBugInSomething.java:255) [junit4] > at java.lang.Thread.run(Thread.java:724)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=12, name=Thread-2, state=RUNNABLE, group=TGRP-TestBugInSomething] [junit4] > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded [junit4] > at __randomizedtesting.SeedInfo.seed([1BFA8BADE39EDF70]:0) [junit4] > at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.toString(CharTermAttributeImpl.java:269) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:696) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:605) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.access$000(BaseTokenStreamTestCase.java:57) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase$AnalysisThread.run(BaseTokenStreamTestCase.java:476) [junit4] 2> NOTE: test params are: codec=DummyCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=DUMMY, chunkSize=313), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=DUMMY, chunkSize=313)), sim=RandomSimilarityProvider(queryNorm=true,coord=crazy): {}, locale=th_TH_TH_#u-nu-thai, timezone=Europe/Copenhagen [junit4] 2> NOTE: Linux 3.5.0-27-generic amd64/Oracle Corporation 1.7.0_25 (64-bit)/cpus=8,threads=1,free=155107808,total=477233152 [junit4] 2> NOTE: All tests run in this JVM: [TestBugInSomething] [junit4] Completed in 30.92s, 1 test, 1 error <<< FAILURES! [junit4] [junit4] [junit4] Tests with failures: [junit4] - org.apache.lucene.analysis.core.TestBugInSomething.testUnicodeShinglesAndNgrams I will see if i can make a less-ridiculous version of the test that still fails with the bug.
        Hide
        ASF subversion and git services added a comment -

        Commit 1531193 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1531193 ]

        LUCENE-5269: make test use less RAM

        Show
        ASF subversion and git services added a comment - Commit 1531193 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1531193 ] LUCENE-5269 : make test use less RAM
        Hide
        ASF subversion and git services added a comment -

        Commit 1531195 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1531195 ]

        LUCENE-5269: Fix NGramTokenFilter length filtering

        Show
        ASF subversion and git services added a comment - Commit 1531195 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531195 ] LUCENE-5269 : Fix NGramTokenFilter length filtering
        Hide
        ASF subversion and git services added a comment -

        Commit 1531202 from Robert Muir in branch 'dev/branches/lucene_solr_4_5'
        [ https://svn.apache.org/r1531202 ]

        LUCENE-5269: Fix NGramTokenFilter length filtering

        Show
        ASF subversion and git services added a comment - Commit 1531202 from Robert Muir in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1531202 ] LUCENE-5269 : Fix NGramTokenFilter length filtering
        Hide
        Uwe Schindler added a comment -

        This is so crazy! Why did we never hit this combination before?

        Thanks for fixing, although I see the CodePointLengthFilter not really as a bug fix, it is more a new feature! Maybe explicitely add this as "new feature" to changes.txt?

        Show
        Uwe Schindler added a comment - This is so crazy! Why did we never hit this combination before? Thanks for fixing, although I see the CodePointLengthFilter not really as a bug fix, it is more a new feature! Maybe explicitely add this as "new feature" to changes.txt?
        Hide
        Robert Muir added a comment -

        I didnt want new features mixed with bugfixes really

        But in my opinion this was the simplest way to solve the problem: to just add a filter like this and for it to use that instead of LengthFilter.

        I think it would be wierd to see "new features" in a 4.5.1?

        Show
        Robert Muir added a comment - I didnt want new features mixed with bugfixes really But in my opinion this was the simplest way to solve the problem: to just add a filter like this and for it to use that instead of LengthFilter. I think it would be wierd to see "new features" in a 4.5.1?
        Hide
        Robert Muir added a comment -

        This is so crazy! Why did we never hit this combination before?

        This combination is especially good at finding the bug, here's why:

        Tokenizer tokenizer = new EdgeNGramTokenizer(TEST_VERSION_CURRENT, reader, 2, 94);
        TokenStream stream = new ShingleFilter(tokenizer, 5);
        stream = new NGramTokenFilter(TEST_VERSION_CURRENT, stream, 55, 83);
        

        The edge-ngram has min=2 max=94, its basically brute forcing every token size.
        then the shingles makes tons of tokens with positionIncrement=0.
        so it makes it easy for the (previously buggy ngramtokenfilter with wrong length filter) to misclassify tokens with its logic expecting codepoints, emit an initial token with posinc=0:

        if ((curPos + curGramSize) <= curCodePointCount) {
        ...
                  posIncAtt.setPositionIncrement(curPosInc);
        
        Show
        Robert Muir added a comment - This is so crazy! Why did we never hit this combination before? This combination is especially good at finding the bug, here's why: Tokenizer tokenizer = new EdgeNGramTokenizer(TEST_VERSION_CURRENT, reader, 2, 94); TokenStream stream = new ShingleFilter(tokenizer, 5); stream = new NGramTokenFilter(TEST_VERSION_CURRENT, stream, 55, 83); The edge-ngram has min=2 max=94, its basically brute forcing every token size. then the shingles makes tons of tokens with positionIncrement=0. so it makes it easy for the (previously buggy ngramtokenfilter with wrong length filter) to misclassify tokens with its logic expecting codepoints, emit an initial token with posinc=0: if ((curPos + curGramSize) <= curCodePointCount) { ... posIncAtt.setPositionIncrement(curPosInc);
        Hide
        Uwe Schindler added a comment -

        I didnt want new features mixed with bugfixes really

        I agree! But now we have the "new feature", so I just asked to add this as a separate entry in CHANGES.txt under "New features", just the new filter nothing more.

        Show
        Uwe Schindler added a comment - I didnt want new features mixed with bugfixes really I agree! But now we have the "new feature", so I just asked to add this as a separate entry in CHANGES.txt under "New features", just the new filter nothing more.
        Hide
        ASF subversion and git services added a comment -

        Commit 1531368 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1531368 ]

        LUCENE-5269: satisfy the policeman

        Show
        ASF subversion and git services added a comment - Commit 1531368 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1531368 ] LUCENE-5269 : satisfy the policeman
        Hide
        ASF subversion and git services added a comment -

        Commit 1531369 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1531369 ]

        LUCENE-5269: satisfy the policeman

        Show
        ASF subversion and git services added a comment - Commit 1531369 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531369 ] LUCENE-5269 : satisfy the policeman

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development