Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1366

Training of MaxEnt Model with large corpora fails with java.io.UTFDataFormatException

    XMLWordPrintableJSON

Details

    Description

      As written on [dev@opennlp.a.o|https://lists.apache.org/thread/vc5lfzj81tco703noqxpvy8sfj8fw8b1], we are working on training a large opennlp maxent model for lemmatizing
      German texts. We use a wikipedia tree bank from Tübingen.

      This consumes > 2 TB of RAM during training but will finish after some time. However, writing this model will result in a  java.io.UTFDataFormatException 

      However, training such a big model isn't feasable for debugging. Gladly, a similar with a smaller dataset is found on Stackoverflow: https://stackoverflow.com/questions/70064477/opennlp-lemmatizertrainer-utfdataformatexception-encoded-string-too-long 

      It contains the OpenNLP CLI command to train a lemmatizer on a much smaller dataset.

      The stacktrace is raced while writing a String as UTF in DataOutputStream, which has a hard-coded size limitation in the JDK (for reasons behind my knowledge )

      Stacktrace:

      java.io.UTFDataFormatException: encoded string too long: 383769 bytes
              at java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
              at java.base/java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
              at opennlp.tools.ml.maxent.io.BinaryGISModelWriter.writeUTF(BinaryGISModelWriter.java:71)
              at opennlp.tools.ml.maxent.io.GISModelWriter.persist(GISModelWriter.java:97)
              at opennlp.tools.ml.model.GenericModelWriter.persist(GenericModelWriter.java:75)
              at opennlp.tools.util.model.ModelUtil.writeModel(ModelUtil.java:71)
              at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:36)
              at opennlp.tools.util.model.GenericModelSerializer.serialize(GenericModelSerializer.java:29)
              at opennlp.tools.util.model.BaseModel.serialize(BaseModel.java:597)
              at opennlp.tools.cmdline.CmdLineUtil.writeModel(CmdLineUtil.java:182)
              at opennlp.tools.cmdline.lemmatizer.LemmatizerTrainerTool.run(LemmatizerTrainerTool.java:77)
              at opennlp.tools.cmdline.CLI.main(CLI.java:256) 

      Attachments

        Issue Links

          Activity

            People

              rzo1 Richard Zowalla
              rzo1 Richard Zowalla
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: