[LUCENE-10049] part of speech tagging for Korean, Japanese - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Trivial
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
- newbie

Lucene Fields:

New, Patch Available

Description

Korean(nori) and Japanese(kuromoji) analyzers behave the same way by using a dictionary-based and finite-state-based approach to identify words (aka tokens).

When analyzing Korean/Japanese inputs, it needs to perform a lookup in the dictionary on every character in order to build the lattice of all possible segmentations. In order to achieve this efficiently, we encode the full vocabulary in FST (finite state transducer). So we can analyze text using the Viterbi algorithm to find the most likely segmentation (called the Viterbi path) of any input written in Korean or Japanese.

org.apache.lucene.analysis.ko.GraphvizFormatter
org.apache.lucene.analysis.ja.GraphvizFormatter

Those two already have Graphviz to visualize the Viterbi lattice built from input texts. However, according to my experience, part of speech is significant to diagnose why the outputs look like since this works with the dictionary-based approach.

Adding tokens' part of speech will help users to understand the analyzers. I and some users are using those classes during their Lucene-related projects although it's a very trivial part. Opening a PR after issue review.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-10049.patch
13/Aug/21 23:58
2 kB
Uihyun Kim

Activity

People

Assignee:: Unassigned

Reporter:: Uihyun Kim

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 13/Aug/21 23:58

Updated:: 28/Aug/22 16:24