Details
-
Documentation
-
Status: Resolved
-
Trivial
-
Resolution: Won't Fix
-
3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2
-
None
-
None
Description
The documentation is wrong regarding memory consumption of java.lang.String
https://spark.apache.org/docs/latest/tuning.html#memory-tuning
internally, the source for this doc section is written here:
https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100
* Java `String`s have about 40 bytes of overhead over the raw string data (since they store it in an array of `Char`s and keep extra data such as the length), and store each character as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a 10-character string can easily consume 60 bytes.
reason: since java version >= 9 ... Java has optimized the problem described in the doc.
It used to be 16 bytes of header + using internally char coded as UTF-16
Notice that before jdk 9 (since jdk 6, there was also an internal flag for HotSpot JVM : -XX:+UseCompressedStrings , but it was not enabled by default )
Since OpenJdk >= 9... with the implementation of JEP 254 ( https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 when they are simple Latin1 text, otherwise as before. There is now an extra byte field in class java.lang.String to say if the "coder" is optimized for Latin1.
This field is described here in OpenJdk source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170
The computation for the memory size of String was "40+2*charCount" ... it is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it is not Latin1 text
the object overhead is 44 because of alignment... not 40+1 for adding one "byte" field