Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-830

Huge runtime improvement on training (POS, Chunk, ...)

    XMLWordPrintableJSON

    Details

    • Flags:
      Patch, Important

      Description

      opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* (i.e. every model) and leads to disastrous performance.

      This hashtable is probably legacy some legacy and is highly inefficient. A simple drop-in replacement by a java.util.HashMap wrapper solves the issue, doesn't break compatibility and does not add any dependency.

      Training a pos-tagger on a large dataset with custom tags, I see a factor 5 improvement. It also seems to improve all ML models training pipeline.

      See : https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java

      For a quick fix.

        Attachments

          Activity

            People

            • Assignee:
              joern Jörn Kottmann
              Reporter:
              subercaze.julien@gmail.com Julien Subercaze
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified