[LUCENE-10676] FieldInfo#name contributes significantly to heap usage at scale - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Minor
Resolution: Unresolved
Affects Version/s: 9.3
Fix Version/s: None
Component/s: core/codecs
Labels:
- Master
- heap
- membership
- scalability
Environment:

Seen in Lucene 9.3.0 running on Linux using JDK18 but seems independent of environment.

Lucene Fields:

New, Patch Available

Description

We encountered an Elasticsearch user with high heap usage, a significant proportion of which was down to the contents of `FieldInfo#name`.

This user was certainly pushing some scalability boundaries: this single process had thousands of active Lucene indices, many with 10k+ fields, and many indices had hundreds of segments due to an excess of flushes, so in total they had an enormous number of `FieldInfo` instances. Still, the bulk of the heap usage was just field names, and the total number of distinct field names was fairly small. That's pretty common, especially for time-based data like logs. Some kind of interning or deduplication of these strings would have reduced their heap usage by many GBs.

Is there a way we could deduplicate these strings? Deduplicating them across segments within each index would already have helped, but ideally we'd like to deduplicate them across indices too.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2022-08-08-13-23-37-050.png
08/Aug/22 11:23
193 kB
Armin Braun

Activity

People

Assignee:: Unassigned

Reporter:: David Turner

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 08/Aug/22 09:39

Updated:: 08/Sep/22 05:38