Patch looks good to me... so the basics are we apply a different penalty based on
whether the text is kanji or not, rather than just a single penalty of 10000 (and some parameter tuning) ?
Note that both the tests and the heuristic is tuned for IPADIC. Hence, we need to revisit this when we add support for other dictionaries/models.
I think this is ok for now.
Long term (if there end out being different values for other dictionaries), we can conditionalize these on dictionary type:
either at build-time (recording these values into dictionary), or better, record the dictionary type itself and conditionalize
these at run-time based on dictionary type.
By recording the type, we would also be able to use e.g. assumeTrue(dictionaryType == IPADIC) in unit tests and things like that,
and who knows what else, but lets not worry about it here.