[LUCENE-8816] Decouple Kuromoji's morphological analyser and its dictionary - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Patch Available
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

I've inspired by this mail-list thread.
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E

As many Japanese already know, default built-in dictionary bundled with Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. While it has been slowly obsoleted, well-maintained and/or extended dictionaries risen up in recent years (e.g. mecab-ipadic-neologd, UniDic). To use them with Kuromoji, some attempts/projects/efforts are made in Japan.

However current architecture - dictionary bundled jar - is essentially incompatible with the idea "switch the system dictionary", and developers have difficulties to do so.

Traditionally, the morphological analysis engine (viterbi logic) and the encoded dictionary (language model) had been decoupled (like MeCab, the origin of Kuromoji, or lucene-gosen). So actually decoupling them is a natural idea, and I feel that it's good time to re-think the current architecture.

Also this would be good for advanced users who have customized/re-trained their own system dictionary.

Goals of this issue:

Decouple JapaneseTokenizer itself and encoded system dictionary.
Implement dynamic dictionary load mechanism.
Provide developer-oriented dictionary build tool.

Non-goals:

Provide learner or language model (it's up to users and should be outside the scope).

I have not dove into the code yet, so have no idea about it's easy or difficult at this moment.

Attachments

Issue Links

is related to

LUCENE-8869 Build kuromoji system dictionary as a separated jar and load it from JapaneseTokenizer at runtime

Patch Available

relates to

LUCENE-8817 Combine Nori and Kuromoji DictionaryBuilder

Patch Available

LUCENE-4056 Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

Patch Available

Activity

People

Assignee:: Unassigned

Reporter:: Tomoko Uchida

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 28/May/19 13:18

Updated:: 28/Nov/24 21:01