Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1197

FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.4
    • Fix Version/s: 1.9.0
    • Component/s: Machine Learning
    • Labels:
      None

      Description

      FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" (lower case). It looks a bug to me because they're not lower case letters, but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes care only Europe/American languages.

      For example, in Japanese NER problem, typical token classes are as follows:

      • DIGIT
      • HIRA : あ, い, う, え, お etc.
      • KATA : ア, イ, ウ, エ, オ etc.
      • ALPHA : we don't need to distinguish lower/upper case
      • OTHER

      I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have additional token classes I mentioned above, but later on, someone who comes from Asia and may claim similar thing.

      I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.

        Attachments

          Activity

            People

            • Assignee:
              koji Koji Sekiguchi
              Reporter:
              koji Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: