Details
-
Bug
-
Status: Patch Available
-
Minor
-
Resolution: Unresolved
-
9.3
-
None
-
Seen in Lucene 9.3.0 running on Linux using JDK18 but seems independent of environment.
-
New, Patch Available
Description
We encountered an Elasticsearch user with high heap usage, a significant proportion of which was down to the contents of `FieldInfo#name`.
This user was certainly pushing some scalability boundaries: this single process had thousands of active Lucene indices, many with 10k+ fields, and many indices had hundreds of segments due to an excess of flushes, so in total they had an enormous number of `FieldInfo` instances. Still, the bulk of the heap usage was just field names, and the total number of distinct field names was fairly small. That's pretty common, especially for time-based data like logs. Some kind of interning or deduplication of these strings would have reduced their heap usage by many GBs.
Is there a way we could deduplicate these strings? Deduplicating them across segments within each index would already have helped, but ideally we'd like to deduplicate them across indices too.