Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
-
New
Description
With the recent accidental regression of Japanese (Kuromoji) tokenization throughput due to exciting FST optimizations, we added new nightly Lucene benchmarks to measure tokenization throughput for JapaneseTokenizer: https://home.apache.org/~mikemccand/lucenebench/analyzers.html
It has already been running for ~5-6 weeks now! But for some reason, it looks bi-modal? "Normally" it is ~.45 M tokens/sec, but for two data points it dropped down to ~.33 M tokens/sec, which is odd. It could be hotspot noise maybe? But would be good to get to the root cause and fix it if possible.
Hotspot noise that randomly steals ~27% of your tokenization throughput is no good!!
Or does anyone have any other ideas of what could be bi-modal in Kuromoji? I don't think this performance test has any randomness in it...