Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10171

Caching issue on dictionary-based OpenNLPLemmatizerFilterFactory

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 9.0, 7.7.3, 8.10
    • 9.1
    • modules/analysis
    • None
    • Yes

    Description

      When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory caches internally only the string format of the dictionary, and not the DictionaryLemmatizer object. This results in parsing and creating a new DictionaryLemmatizer object every time the OpenNLPLemmatizerFilterFactory.create() is called.

      In our case, with a large lemmas.txt file (5MB) and the OpenNLPLemmatizerFilter used in many fields across our setup and in multiple collections (we use Solr), we had several random OOM issues and generally high server load due to GC activity. After heap dump analysis we noticed few thousands of DictionaryLemmatizer instances of around 80MB each.

      By switching the caching to the DictionaryLemmatizer instead of the String, we were able to resolve these issues. I will be attaching a PR for review, please let me know of any comments.

      Thanks!

      Attachments

        Issue Links

          Activity

            People

              magibney Michael Gibney
              spyk Spyros Kapnissis
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 10m
                  4h 10m