[LUCENE-6837] Add N-best output capability to JapaneseTokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 5.3
Fix Version/s: 6.0
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

Japanese morphological analyzers often generate mis-segmented tokens. N-best output reduces the impact of mis-segmentation on search result. N-best output is more meaningful than character N-gram, and it increases hit count too.

If you use N-best output, you can get decompounded tokens (ex: "シニアソフトウェアエンジニア" =>

{"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}

) and overwrapped tokens (ex: "数学部長谷川" =>

{"数学", "部", "部長", "長谷川", "谷川"}

), depending on the dictionary and N-best parameter settings.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-6837 for 5.4.zip
12/Jan/16 01:13
7.84 MB
Ippei UKAI
LUCENE-6837.patch
13/Oct/15 05:32
32 kB
KONNO, Hiroharu
LUCENE-6837.patch
08/Nov/15 11:15
42 kB
Christian Moen
LUCENE-6837.patch
18/Nov/15 10:08
51 kB
Christian Moen
LUCENE-6837.patch
20/Nov/15 06:53
51 kB
KONNO, Hiroharu
LUCENE-6837.patch
27/Nov/15 10:50
51 kB
Christian Moen

Activity

People

Assignee:: Christian Moen

Reporter:: KONNO, Hiroharu

Votes:: 3 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 13/Oct/15 05:29

Updated:: 28/Aug/22 14:44

Resolved:: 18/Feb/17 00:13