Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10360

BeiderMorseFilter: TestRandomChains fails with IndexOutOfBounds on empty term text

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • modules/analysis
    • New

    Description

      Error seen:

        2> TEST FAIL: useCharFilter=true text='Uf?F ?wlu{0 <!--'a'
        2> Exception from random analyzer:
        2> charfilters=
        2> tokenizer=
        2>   org.apache.lucene.analysis.ja.JapaneseTokenizer(org.apache.lucene.util.AttributeFactory$1@4c00d592, null, false, true, NORMAL)
        2> filters=
        2>   Conditional:org.apache.lucene.analysis.pt.PortugueseLightStemFilter(OneTimeWrapper@3fad923e term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation (en)=null,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false)
        2>   org.apache.lucene.analysis.phonetic.BeiderMorseFilter(ValidatingTokenFilter@43fbbeb0 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation (en)=null,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, org.apache.commons.codec.language.bm.PhoneticEngine@631e916d)
        2>   Conditional:org.apache.lucene.analysis.synonym.SynonymGraphFilter(OneTimeWrapper@77051976 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation (en)=null,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, org.apache.lucene.analysis.synonym.SynonymMap@69152718, true)
         >     java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 0
         >         at __randomizedtesting.SeedInfo.seed([1E22B4EE8663AE48:23C39D8FC171B388]:0)
         >         at org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:433)
         >         at org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:384)
         >         at org.apache.lucene.analysis.phonetic@10.0.0-SNAPSHOT/org.apache.lucene.analysis.phonetic.BeiderMorseFilter.incrementToken(BeiderMorseFilter.java:96)
      

      Actually the issue happens if:

      • PhoneticEngine uses NameType=SEPHARDIC
      • The term is empty or the cleanup done by the encode is empty (whitespace and dashes removed)

      The problem is that the encoder calls String.split() and assumes the array always has size>=1.

      You can write an easy test, but the bug has to be reported upstream.

      Attachments

        Activity

          People

            Unassigned Unassigned
            uschindler Uwe Schindler
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated: