[SPARK-40642] wrong doc on memory tuning regarding String object memory size, changed since version>=9 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Resolved
Priority: Trivial
Resolution: Won't Fix
Affects Version/s: 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2
Fix Version/s: None
Component/s: Documentation
Labels:
None

Description

The documentation is wrong regarding memory consumption of java.lang.String
https://spark.apache.org/docs/latest/tuning.html#memory-tuning

internally, the source for this doc section is written here:
https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100

* Java `String`s have about 40 bytes of overhead over the raw string data (since they store it in an
  array of `Char`s and keep extra data such as the length), and store each character
  as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a 10-character string can
  easily consume 60 bytes.

reason: since java version >= 9 ... Java has optimized the problem described in the doc.
It used to be 16 bytes of header + using internally char coded as UTF-16

Notice that before jdk 9 (since jdk 6, there was also an internal flag for HotSpot JVM : -XX:+UseCompressedStrings , but it was not enabled by default )

Since OpenJdk >= 9... with the implementation of JEP 254 ( https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 when they are simple Latin1 text, otherwise as before. There is now an extra byte field in class java.lang.String to say if the "coder" is optimized for Latin1.

This field is described here in OpenJdk source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170
The computation for the memory size of String was "40+2*charCount" ... it is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it is not Latin1 text

the object overhead is 44 because of alignment... not 40+1 for adding one "byte" field

Attachments

Issue Links

links to

[Github] Pull Request #38085 (Arnaud-Nauwynck)

Activity

People

Assignee:: Unassigned

Reporter:: Arnaud Nauwynck

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Oct/22 20:24

Updated:: 05/Dec/22 14:51

Resolved:: 05/Dec/22 14:51