Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1505

Reduce object creation in NGramCharModel and StringUtil

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • Language Detector
    • None

    Description

      During a profiling session, I noticed that many tests in opennlp.tools.langdetect take quite some time for execution. Digging deeper into those tests, it quickly became obvious that StringUtil#toLowerCase() was creating new Strings for every call of this method (see NGramCharModel#add(...) lines 99 to 108.

      Being called in NGramCharModel quite frequently, this resulted in creation of millions of String objects during building ngrams for given input.

      Aims:

      • Reduce objection creation and thus creation of millions of string objects
      • Improve runtime of the langdetect tests (and potentially others)

      Idea:

      • Use (Heap)CharBuffer instead of String so that underlying char arrays can be re-used, instead of copying the chars over to a new string for each "toLowerCase"...

      Note:

      • A corresponding patch / PR should be tested with/against the Evaluation suite.

      Comments welcome.

      Attachments

        Activity

          People

            mawiesne Martin Wiesner
            mawiesne Martin Wiesner
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: