Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9457

Why is Kuromoji tokenization throughput bimodal?

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      With the recent accidental regression of Japanese (Kuromoji) tokenization throughput due to exciting FST optimizations, we added new nightly Lucene benchmarks to measure tokenization throughput for JapaneseTokenizerhttps://home.apache.org/~mikemccand/lucenebench/analyzers.html

      It has already been running for ~5-6 weeks now!  But for some reason, it looks bi-modal?  "Normally" it is ~.45 M tokens/sec, but for two data points it dropped down to ~.33 M tokens/sec, which is odd.  It could be hotspot noise maybe?  But would be good to get to the root cause and fix it if possible.

      Hotspot noise that randomly steals ~27% of your tokenization throughput is no good!!

      Or does anyone have any other ideas of what could be bi-modal in Kuromoji?  I don't think this performance test has any randomness in it...

      Attachments

        Activity

          People

            Unassigned Unassigned
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: