Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40642

wrong doc on memory tuning regarding String object memory size, changed since version>=9

    XMLWordPrintableJSON

Details

    • Documentation
    • Status: Resolved
    • Trivial
    • Resolution: Won't Fix
    • 3.1.0, 3.1.1, 3.1.2, 3.2.0, 3.1.3, 3.2.1, 3.3.0, 3.2.2
    • None
    • Documentation
    • None

    Description

      The documentation is wrong regarding memory consumption of java.lang.String
      https://spark.apache.org/docs/latest/tuning.html#memory-tuning

      internally, the source for this doc section is written here:
      https://github.com/apache/spark/blob/master/docs/tuning.md?plain=1#L100

      * Java `String`s have about 40 bytes of overhead over the raw string data (since they store it in an
        array of `Char`s and keep extra data such as the length), and store each character
        as *two* bytes due to `String`'s internal usage of UTF-16 encoding. Thus a 10-character string can
        easily consume 60 bytes.
      

      reason: since java version >= 9 ... Java has optimized the problem described in the doc.
      It used to be 16 bytes of header + using internally char coded as UTF-16

      Notice that before jdk 9 (since jdk 6, there was also an internal flag for HotSpot JVM : -XX:+UseCompressedStrings , but it was not enabled by default )

      Since OpenJdk >= 9... with the implementation of JEP 254 ( https://openjdk.org/jeps/254 ), Strings are now internally encoded in UTF8 when they are simple Latin1 text, otherwise as before. There is now an extra byte field in class java.lang.String to say if the "coder" is optimized for Latin1.

      This field is described here in OpenJdk source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/String.java#L170
      The computation for the memory size of String was "40+2*charCount" ... it is now "44+1*charCount" when it is Latin1 text, else "44+2*charCount" when it is not Latin1 text

      the object overhead is 44 because of alignment... not 40+1 for adding one "byte" field

      Attachments

        Activity

          People

            Unassigned Unassigned
            arnaud.nauwynck Arnaud Nauwynck
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: