Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
New
Description
We now have common dictionary interfaces for kuromoji and nori (LUCENE-10393). A natural question would be: is it possible to unify the Japanese/Korean tokenizers?
The core methods of the two tokenizers are `parse()` and `backtrace()` to calculate the minimum cost path by Viterbi search. I'd set the goal of this issue to factoring out them into a separate class (in analysis-common) that is shared between JapaneseTokenizer and KoreanTokenizer.
The algorithm to solve the minimum cost path itself is of course language-agnostic, so I think it should be theoretically possible; the most difficult part here might be the N-best path calculation - which is supported only by JapaneseTokenizer and not by KoreanTokenizer.
Attachments
Issue Links
1.
|
Unify "Token" interface in Kuromoji and Nori | Resolved | Tomoko Uchida |
|