Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10677

Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

Add voteWatch issue
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Minor
    • Resolution: Unresolved
    • 9.3
    • None
    • core/codecs
    • New

    Description

      This has the same origin as issue LUCENE-10676 . Running a single process with thousands of fields across many indexes will lead to a lot of duplicate strings retained as keys and values in the `attributes` map. This can amount to GBs of heap for thousands of fields across a few thousand segments. The strings in the below heap dump analysis account for more than half  (roughly 2/3 and the field names are somewhat unusually long in this example) the duplicate strings from `FieldInfo` instances.

      If we could deduplicate theses obvious known strings when reading `FieldInfo` we could save GBs of heap for use cases like this.

       

      Attachments

        1. lucene_duplicate_fields.png
          162 kB
          Armin Braun

        Activity

          People

            Unassigned Unassigned
            original-brownbear Armin Braun

            Dates

              Created:
              Updated:

              Slack

                Issue deployment