Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10363

JapaneseCompletionFilter messes up offsets

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • New

    Description

      It is a tokenfilter, tries to change offsets, so of course TestRandomChains finds bugs in it:

      NOTE: reproduce with: gradlew test --tests TestRandomChains.testRandomChainsWithLargeStrings -Dtests.seed=E233A5FAC016E02 -Dtests.nightly=true -Dtests.slow=true -Dtests.locale=en-TV -Dtests.timezone=Asia/Saigon -Dtests.asserts=true -Dtests.file.encoding=UTF-8
      
      org.apache.lucene.analysis.tests.TestRandomChains > test suite's output saved to /home/rmuir/workspace/lucene/lucene/analysis/integration.tests/build/test-results/test_54/outputs/OUTPUT-org.apache.lucene.analysis.tests.TestRandomChains.txt, copied below:
        2> stage 0: lk<[1-3] +1> p<[6-7] +1> ngtoixtmldzsjz<[10-24] +1> uoq<[25-28] +1> HANGUL<[28-28] +1> o<[29-30] +1> HANGUL<[31-31] +1> VulliPHsZzn<[32-43] +1>
        2> stage 1: lk<[1-3] +1> 850000<[1-3] +0> p<[6-7] +1> 700000<[6-7] +0> ngtoixtmldzsjz<[10-24] +1> 653543<[10-24] +0> uoq<[25-28] +1> 050000<[25-28] +0> HANGUL<[28-28] +1> 565800<[28-28] +0> o<[29-30] +1> 000000<[29-30] +0> HANGUL<[31-31] +1> 565800<[31-31] +0> VulliPHsZzn<[32-43] +1> 787460<[32-43] +0>
        2> stage 2: ngtoixtmldzsjz 653543<[10-24] +0> 653543<[10-24] +1> 653543 uoq<[10-28] +0> uoq<[25-28] +1> uoq 050000<[25-28] +0> 050000<[25-28] +1> 050000 HANGUL<[25-28] +0> HANGUL<[28-28] +1> HANGUL 565800<[28-28] +0> 565800<[28-28] +1> 565800 o<[28-30] +0> o<[29-30] +1> o 000000<[29-30] +0> 000000<[29-30] +1> 000000 HANGUL<[29-31] +0> HANGUL<[31-31] +1> HANGUL 565800<[31-31] +0> 565800<[31-31] +1> 565800 VulliPHsZzn<[31-43] +0> VulliPHsZzn<[32-43] +1>
        2> last stage: ngtoixtmldzsjz<[10-24] +1> ngtoixtmldzsjz 653543<[10-24] +0> 653543<[10-24] +1> 653543 uoq<[10-28] +0> uoq<[25-28] +1> uoq 050000<[25-28] +1> 050000<[25-28] +1> 050000 HANGUL<[25-28] +1> HANGUL<[28-28] +1> HANGUL 565800<[28-28] +0> 565800<[28-28] +1> 565800 o<[28-30] +0> o<[29-30] +1> o 000000<[29-30] +0> 000000<[29-30] +1> 000000 HANGUL<[29-31] +0> HANGUL<[31-31] +1> HANGUL 565800<[31-31] +1> 565800<[31-31] +1> 565800 VulliPHsZzn<[31-43] +0>
        2> TEST FAIL: useCharFilter=true text='[lk[-.p|) ngtoixtmldzsjz uoqao aVulliPHsZzn wxsk'
        2> Exception from random analyzer:
        2> charfilters=
        2>   org.apache.lucene.analysis.pattern.PatternReplaceCharFilter(a, <HANGUL>, java.io.StringReader@5b3b54eb)
        2> tokenizer=
        2>   org.apache.lucene.analysis.classic.ClassicTokenizer(org.apache.lucene.util.AttributeFactory$1@e29311e9)
        2> filters=
        2>   org.apache.lucene.analysis.phonetic.DaitchMokotoffSoundexFilter(ValidatingTokenFilter@32a6de77 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, true)
        2>   org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@3d044414 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1, q)
        2>   Conditional:org.apache.lucene.analysis.ja.JapaneseCompletionFilter(OneTimeWrapper@435207ec term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,reading=null,reading (en)=null,pronunciation=null,pronunciation (en)=null, INDEX)
         >     java.lang.IllegalStateException: last stage: inconsistent endOffset at pos=19: 31 vs 43; token=565800 VulliPHsZzn
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            rcmuir Robert Muir
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: