Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-830

Huge runtime improvement on training (POS, Chunk, ...)

    XMLWordPrintableJSON

Details

    • Patch, Important

    Description

      opennlp.tools.ml.model.IndexHashTable is custom-made Hashtable that is used to store mapping index. This Hashtable is heavily used in openlp.tools.ml.* (i.e. every model) and leads to disastrous performance.

      This hashtable is probably legacy some legacy and is highly inefficient. A simple drop-in replacement by a java.util.HashMap wrapper solves the issue, doesn't break compatibility and does not add any dependency.

      Training a pos-tagger on a large dataset with custom tags, I see a factor 5 improvement. It also seems to improve all ML models training pipeline.

      See : https://github.com/jsubercaze/opennlp/blob/trunk/opennlp-tools/src/main/java/opennlp/tools/ml/model/IndexHashTable.java

      For a quick fix.

      Attachments

        Activity

          People

            joern Jörn Kottmann
            subercaze.julien@gmail.com Julien Subercaze
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified