[OPENNLP-1505] Reduce object creation in NGramCharModel and StringUtil - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: Language Detector
Labels:
None

Description

During a profiling session, I noticed that many tests in opennlp.tools.langdetect take quite some time for execution. Digging deeper into those tests, it quickly became obvious that StringUtil#toLowerCase() was creating new Strings for every call of this method (see NGramCharModel#add(...) lines 99 to 108.

Being called in NGramCharModel quite frequently, this resulted in creation of millions of String objects during building ngrams for given input.

Aims:

Reduce objection creation and thus creation of millions of string objects
Improve runtime of the langdetect tests (and potentially others)

Idea:

Use (Heap)CharBuffer instead of String so that underlying char arrays can be re-used, instead of copying the chars over to a new string for each "toLowerCase"...

Note:

A corresponding patch / PR should be tested with/against the Evaluation suite.

Comments welcome.

Attachments

Activity

People

Assignee:: Martin Wiesner

Reporter:: Martin Wiesner

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jul/23 12:38

Updated:: 27/Jul/23 06:24

Resolved:: 27/Jul/23 06:24