[LUCENE-9457] Why is Kuromoji tokenization throughput bimodal? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

With the recent accidental regression of Japanese (Kuromoji) tokenization throughput due to exciting FST optimizations, we added new nightly Lucene benchmarks to measure tokenization throughput for JapaneseTokenizer: https://home.apache.org/~mikemccand/lucenebench/analyzers.html

It has already been running for ~5-6 weeks now! But for some reason, it looks bi-modal? "Normally" it is ~.45 M tokens/sec, but for two data points it dropped down to ~.33 M tokens/sec, which is odd. It could be hotspot noise maybe? But would be good to get to the root cause and fix it if possible.

Hotspot noise that randomly steals ~27% of your tokenization throughput is no good!!

Or does anyone have any other ideas of what could be bi-modal in Kuromoji? I don't think this performance test has any randomness in it...

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Aug/20 20:37

Updated:: 28/Aug/22 16:05