Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1556

Improve speed of checksum computation in TwoPassDataIndexer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.9.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.3.1, 2.3.2, 2.3.3
    • 2.4.0
    • Machine Learning
    • None

    Description

      For training ML models, all observations (Events) are indexed via 
      TwoPassDataIndexer#index(ObjectStream<Event> eventStream).

      When #index(..) is run, a tmp file is written and read in again. For the purpose of checksum validation, instances of HashSumEventStream are used to validate the content processed.

      Based on a rather slow toString() implementation in Event, a cryptographic (MD5) message digest is computed. This, however, is much slower than simply computing a checksum (such as a CRC32c value) for both directions (write/read). The (slowing) effect is more problematic when larger training corpora are (pre-)processed, that is, indexed in advance.

      Aims:

      • Speedup the (IO-bound) indexing part prior to the actual CPU-bound training phase.
      • Switch from MD5 to CRC32c, as there is no need for a cryptographic hash function here; it's simply a checksum that is required to decide whether all bytes written are the same bytes that are read.
      • Remove the untested class HashSumEventStream which is just a wrapper for calling a slow toString() in Event to get some bytes to use for the computation of a checksum / md.
      • Provide a replacement for HashSumEventStream, e.g. ChecksumEventStream that makes use of the faster CRC32c checksum computation, avoiding cryptographic hash functions such as MD5.
      • Make sure all existing tests hold.

      Attachments

        Activity

          People

            mawiesne Martin Wiesner
            mawiesne Martin Wiesner
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: