[LUCENE-10360] BeiderMorseFilter: TestRandomChains fails with IndexOutOfBounds on empty term text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
- random-chains

Lucene Fields:

New

Description

Error seen:

  2> TEST FAIL: useCharFilter=true text='Uf?F ?wlu{0 <!--'a'
  2> Exception from random analyzer:
  2> charfilters=
  2> tokenizer=
  2>   org.apache.lucene.analysis.ja.JapaneseTokenizer(org.apache.lucene.util.AttributeFactory$1@4c00d592, null, false, true, NORMAL)
  2> filters=
  2>   Conditional:org.apache.lucene.analysis.pt.PortugueseLightStemFilter(OneTimeWrapper@3fad923e term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation (en)=null,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false)
  2>   org.apache.lucene.analysis.phonetic.BeiderMorseFilter(ValidatingTokenFilter@43fbbeb0 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation (en)=null,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, org.apache.commons.codec.language.bm.PhoneticEngine@631e916d)
  2>   Conditional:org.apache.lucene.analysis.synonym.SynonymGraphFilter(OneTimeWrapper@77051976 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,flags=0,payload=null,baseForm=null,partOfSpeech=null,partOfSpeech (en)=null,reading=null,reading (en)=null,pronunciation=null,pronunciation (en)=null,inflectionType=null,inflectionType (en)=null,inflectionForm=null,inflectionForm (en)=null,keyword=false, org.apache.lucene.analysis.synonym.SynonymMap@69152718, true)
   >     java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 0
   >         at __randomizedtesting.SeedInfo.seed([1E22B4EE8663AE48:23C39D8FC171B388]:0)
   >         at org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:433)
   >         at org.apache.commons.codec@1.13/org.apache.commons.codec.language.bm.PhoneticEngine.encode(PhoneticEngine.java:384)
   >         at org.apache.lucene.analysis.phonetic@10.0.0-SNAPSHOT/org.apache.lucene.analysis.phonetic.BeiderMorseFilter.incrementToken(BeiderMorseFilter.java:96)

Actually the issue happens if:

PhoneticEngine uses NameType=SEPHARDIC
The term is empty or the cleanup done by the encode is empty (whitespace and dashes removed)

The problem is that the encoder calls String.split() and assumes the array always has size>=1.

You can write an easy test, but the bug has to be reported upstream.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Uwe Schindler

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 05/Jan/22 12:11

Updated:: 28/Aug/22 16:34