Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1218

All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.8.4
    • 2.1.0
    • None
    • None

    Description

      All binary implementation of AbstractModelWriter and DataReader throws "java.io.UTFDataFormatException: encoded string too long" in the java.io.DataOutputStream.writeUTF method call when a large dataset (more than 64 KB) is used for training. Looks like, this is a known limitation of java.io.DataOutputStream.writeUTF method.

      Following is the stack trace:

      java.io.UTFDataFormatException: encoded string too long: 97519 bytes

      at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
      at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
      at opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
      at opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
      at opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
      at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
      at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
      at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
      at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)

       

      The implementation should use byte array to resolve this issue. 

      Following is the fix to resolve this issue.

       

      public void writeUTF(String s) throws java.io.IOException

      { byte[] ctxByte = s.getBytes("utf-8"); output.writeInt(ctxByte.length); output.write(ctxByte); //output.writeUTF(s); }

       

      Attachments

        Issue Links

          Activity

            People

              mawiesne Martin Wiesner
              sudheerprem Sudheer Prem
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: