Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.2.0
-
None
Description
During a profiling session, I noticed that many tests in opennlp.tools.langdetect take quite some time for execution. Digging deeper into those tests, it quickly became obvious that StringUtil#toLowerCase() was creating new Strings for every call of this method (see NGramCharModel#add(...) lines 99 to 108.
Being called in NGramCharModel quite frequently, this resulted in creation of millions of String objects during building ngrams for given input.
Aims:
- Reduce objection creation and thus creation of millions of string objects
- Improve runtime of the langdetect tests (and potentially others)
Idea:
- Use (Heap)CharBuffer instead of String so that underlying char arrays can be re-used, instead of copying the chars over to a new string for each "toLowerCase"...
Note:
- A corresponding patch / PR should be tested with/against the Evaluation suite.
Comments welcome.