Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
5.3
-
None
-
New
Description
Japanese morphological analyzers often generate mis-segmented tokens. N-best output reduces the impact of mis-segmentation on search result. N-best output is more meaningful than character N-gram, and it increases hit count too.
If you use N-best output, you can get decompounded tokens (ex: "シニアソフトウェアエンジニア" =>
{"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and overwrapped tokens (ex: "数学部長谷川" =>
{"数学", "部", "部長", "長谷川", "谷川"}), depending on the dictionary and N-best parameter settings.