Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10361

KoreanNumberFilter messes up offsets

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • New

    Description

      It is a tokenfilter, tries to change offsets, so of course TestRandomChains finds bugs in it:

      NOTE: reproduce with: gradlew test --tests TestRandomChains.testRandomChains -Dtests.seed=12BC606B774693E4 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=om-Latn-ET -Dtests.timezone=Australia/Yancowinna -Dtests.asserts=true -Dtests.file.encoding=UTF-8
      
      org.apache.lucene.analysis.tests.TestRandomChains > test suite's output saved to /home/rmuir/workspace/lucene/lucene/analysis/integration.tests/build/test-results/test_16/outputs/OUTPUT-org.apache.lucene.analysis.tests.TestRandomChains.txt, copied below:
        2> stage 0: 뱅<[0-1] +1> Ƒ<[1-2] +1> ė<[3-4] +1> 履<[6-7] +1> jEqyzUT<[8-15] +1>
        2> stage 1: 000000<[0-1] +1> Ƒ<[1-2] +1> ė<[3-4] +1> 000000<[6-7] +1> 154300<[8-15] +1> 454300<[8-15] +0>
        2> last stage: 0<[0-1] +1> Ƒ<[1-2] +1> ė<[3-4] +1> 000000<[6-7] +1> 454300<[8-15] +0>
        2> TEST FAIL: useCharFilter=false text='\ubc45\u0191(\u0117\ud8ad\udf0a\uf9df jEqyzUT '
        2> Exception from random analyzer:
        2> charfilters=
        2>   org.apache.lucene.analysis.cjk.CJKWidthCharFilter(java.io.StringReader@17af5384)
        2>   org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@33e5bdbb, org.apache.lucene.analysis.cjk.CJKWidthCharFilter@1aafd271)
        2> tokenizer=
        2>   org.apache.lucene.analysis.icu.segmentation.ICUTokenizer(org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig@4e6f4690)
        2> filters=
        2>   Conditional:org.apache.lucene.analysis.phonetic.DaitchMokotoffSoundexFilter(OneTimeWrapper@34215eb7 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,script=Common, false)
        2>   org.apache.lucene.analysis.ko.KoreanNumberFilter(ValidatingTokenFilter@7b4a2a5b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,script=Common,keyword=false)
         >     java.lang.IllegalStateException: last stage: inconsistent startOffset at pos=3: 6 vs 8; token=454300
         >         at __randomizedtesting.SeedInfo.seed([12BC606B774693E4:2F5D490A30548E24]:0)
         >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:138)
         >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:1130)
         >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:1028)
         >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:922)
         >         at org.apache.lucene.analysis.tests@10.0.0-SNAPSHOT/org.apache.lucene.analysis.tests.TestRandomChains.testRandomChains(TestRandomChains.java:915)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: