Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.8.4
-
None
Description
FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" (lower case). It looks a bug to me because they're not lower case letters, but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care only Europe/American languages.
For example, in Japanese NER problem, typical token classes are as follows:
- DIGIT
- HIRA : あ, い, う, え, お etc.
- KATA : ア, イ, ウ, エ, オ etc.
- ALPHA : we don't need to distinguish lower/upper case
- OTHER
I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have additional token classes I mentioned above, but later on, someone who comes from Asia and may claim similar thing.
I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.