Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24827

Some memory waste in History Server by strings in AccumulableInfo objects

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.2.2
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      I've analyzed a heap dump of Spark History Server with jxray (www.jxray.com) and found that 42% of the heap is wasted due to duplicate strings. The biggest sources of such strings are the name and value data fields of AccumulableInfo objects:

      7. Duplicate Strings:  overhead 42.1% 
      
        Total strings   Unique strings   Duplicate values  Overhead 
          13,732,278	   729,234	     354,032	     867,177K (42.1%)
      
      Expensive data fields:
      
      
      318,421K (15.4%), 3669685 / 100% dup strings (8 unique), 3669685 dup backing arrays:
      
       ↖org.apache.spark.scheduler.AccumulableInfo.name
      
      178,994K (8.7%), 3674403 / 99% dup strings (35640 unique), 3674403 dup backing arrays:
      
       ↖scala.Some.x
      
      168,601K (8.2%), 3401960 / 92% dup strings (175826 unique), 3401960 dup backing arrays:
      
       ↖org.apache.spark.scheduler.AccumulableInfo.value

      That is, 15.4% of the heap is wasted by AccumulableInfo.name and 8.2% is wasted by AccumulableInfo.value.

      It turns out that the problem has been partially addressed in spark 2.3+, e.g.

      https://github.com/apache/spark/blob/b045315e5d87b7ea3588436053aaa4d5a7bd103f/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L590

      However, this code has two minor problems:

      1. Strings for AccumulableInfo.value are not interned in the above code, only AccumulableInfo.name.
      2. For interning, the code in weakIntern(String) method uses a Guava interner (stringInterner = Interners.newWeakInterner[String]()). This is an old-fashioned, less efficient way of interning strings. Since some 3-4 years old JDK7 version, the built-in JVM String.intern() method is much more efficient, both in terms of CPU and memory.

      It is therefore suggested to add interning for value and replace the Guava interner with String.intern().

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              misha@cloudera.com Misha Dmitriev
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: