[LUCENE-10493] Can we unify the viterbi search logic in the tokenizers of kuromoji and nori? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.0 (main)
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

We now have common dictionary interfaces for kuromoji and nori (~~LUCENE-10393~~). A natural question would be: is it possible to unify the Japanese/Korean tokenizers?

The core methods of the two tokenizers are `parse()` and `backtrace()` to calculate the minimum cost path by Viterbi search. I'd set the goal of this issue to factoring out them into a separate class (in analysis-common) that is shared between JapaneseTokenizer and KoreanTokenizer.
The algorithm to solve the minimum cost path itself is of course language-agnostic, so I think it should be theoretically possible; the most difficult part here might be the N-best path calculation - which is supported only by JapaneseTokenizer and not by KoreanTokenizer.