Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.5.0
-
None
Description
tokenize_ja failed to analyze certain Japanese strings and outputed below error.
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.analysis.ja.JapaneseTokenizer.backtrace(JapaneseTokenizer.java:1024)
at org.apache.lucene.analysis.ja.JapaneseTokenizer.parse(JapaneseTokenizer.java:873)
at org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:474)
at org.apache.lucene.analysis.ja.JapaneseBaseFormFilter.incrementToken(JapaneseBaseFormFilter.java:50)
at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
at org.apache.lucene.analysis.cjk.CJKWidthFilter.incrementToken(CJKWidthFilter.java:63)
at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
at org.apache.lucene.analysis.ja.JapaneseKatakanaStemFilter.incrementToken(JapaneseKatakanaStemFilter.java:63)
at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
at hivemall.nlp.tokenizer.KuromojiUDF.analyzeTokens(KuromojiUDF.java:292)
at hivemall.nlp.tokenizer.KuromojiUDF.evaluate(KuromojiUDF.java:117)
This cause is LUCENE-7279 which has already fixed. Lucene need to be upgraded.
Affected versions are not only v0.5.0 but also v0.4.2.