Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
9.0, 7.7.3, 8.10
-
None
-
Yes
Description
When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory caches internally only the string format of the dictionary, and not the DictionaryLemmatizer object. This results in parsing and creating a new DictionaryLemmatizer object every time the OpenNLPLemmatizerFilterFactory.create() is called.
In our case, with a large lemmas.txt file (5MB) and the OpenNLPLemmatizerFilter used in many fields across our setup and in multiple collections (we use Solr), we had several random OOM issues and generally high server load due to GC activity. After heap dump analysis we noticed few thousands of DictionaryLemmatizer instances of around 80MB each.
By switching the caching to the DictionaryLemmatizer instead of the String, we were able to resolve these issues. I will be attaching a PR for review, please let me know of any comments.
Thanks!
Attachments
Issue Links
- links to