[OPENNLP-1218] All binary implementation of AbstractModelWriter and DataReader throws java.io.UTFDataFormatException when large dataset is used for training - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 1.8.4
Fix Version/s: 2.1.0
Component/s: None
Labels:
None

Description

All binary implementation of AbstractModelWriter and DataReader throws "java.io.UTFDataFormatException: encoded string too long" in the java.io.DataOutputStream.writeUTF method call when a large dataset (more than 64 KB) is used for training. Looks like, this is a known limitation of java.io.DataOutputStream.writeUTF method.

Following is the stack trace:

java.io.UTFDataFormatException: encoded string too long: 97519 bytes

at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
at opennlp.tools.ml.naivebayes.BinaryNaiveBayesModelWriter.writeUTF(BinaryNaiveBayesModelWriter.java:67)
at opennlp.tools.ml.naivebayes.NaiveBayesModelWriter.persist(NaiveBayesModelWriter.java:169)
at opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)

The implementation should use byte array to resolve this issue.

Following is the fix to resolve this issue.

public void writeUTF(String s) throws java.io.IOException

{ byte[] ctxByte = s.getBytes("utf-8"); output.writeInt(ctxByte.length); output.write(ctxByte); //output.writeUTF(s); }

Attachments

Issue Links

duplicates

OPENNLP-1366 Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

Closed

Activity

People

Assignee:: Martin Wiesner

Reporter:: Sudheer Prem

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/Aug/18 04:40

Updated:: 09/Dec/22 11:56

Resolved:: 09/Dec/22 10:45