Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1221

FeatureGeneratorUtil.tokenFeature() is too specific for some languages

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.9.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As I described in OPENNLP-1197, in Japanese NER problem, we usually use only DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER for token classes. What FeatureGeneratorUtil.tokenFeature() provides at present are too specific. I don't need to distinguish among lc (lowercase alphabet), ac (all capital letters) and ic (initial capital letter), for example.

      By way of trial, if I applied the following patch in order to avoid "too specific token class generation":

      diff --git a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
      index e6b8af95..405938d1 100644
      --- a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
      +++ b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java
      @@ -29,6 +29,8 @@ public class FeatureGeneratorUtil {
         private static final String TOKEN_AND_CLASS_PREFIX = "w&c";
       
         private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$");
      +  private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$");
      +  private static final Pattern pAlpha = Pattern.compile("^\\p{IsAlphabetic}+$");
       
         /**
          * Generates a class name for the specified token.
      @@ -64,48 +66,11 @@ public class FeatureGeneratorUtil {
           else if (pattern.isAllKatakana()) {
             feat = "jak";
           }
      -    else if (pattern.isAllLowerCaseLetter()) {
      -      feat = "lc";
      +    else if (pDigit.matcher(token).find()) {
      +      feat = "digit";
           }
      -    else if (pattern.digits() == 2) {
      -      feat = "2d";
      -    }
      -    else if (pattern.digits() == 4) {
      -      feat = "4d";
      -    }
      -    else if (pattern.containsDigit()) {
      -      if (pattern.containsLetters()) {
      -        feat = "an";
      -      }
      -      else if (pattern.containsHyphen()) {
      -        feat = "dd";
      -      }
      -      else if (pattern.containsSlash()) {
      -        feat = "ds";
      -      }
      -      else if (pattern.containsComma()) {
      -        feat = "dc";
      -      }
      -      else if (pattern.containsPeriod()) {
      -        feat = "dp";
      -      }
      -      else {
      -        feat = "num";
      -      }
      -    }
      -    else if (pattern.isAllCapitalLetter()) {
      -      if (token.length() == 1) {
      -        feat = "sc";
      -      }
      -      else {
      -        feat = "ac";
      -      }
      -    }
      -    else if (capPeriod.matcher(token).find()) {
      -      feat = "cp";
      -    }
      -    else if (pattern.isInitialCapitalLetter()) {
      -      feat = "ic";
      +    else if (pAlpha.matcher(token).find()) {
      +      feat = "alpha";
           }
           else {
             feat = "other";
      

      total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think I have a lot of room yet to tune and increase the performance.

      Fortunately, I could add japanese-addon project to opennlp-addons in the previous ticket, I'd like to add some programs that generate simpler token classes in japanese-addon.

        Attachments

          Activity

            People

            • Assignee:
              koji Koji Sekiguchi
              Reporter:
              koji Koji Sekiguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: