I've analyzed a heap dump of Spark History Server with jxray (www.jxray.com) and found that 42% of the heap is wasted due to duplicate strings. The biggest sources of such strings are the name and value data fields of AccumulableInfo objects:
That is, 15.4% of the heap is wasted by AccumulableInfo.name and 8.2% is wasted by AccumulableInfo.value.
It turns out that the problem has been partially addressed in spark 2.3+, e.g.
However, this code has two minor problems:
- Strings for AccumulableInfo.value are not interned in the above code, only AccumulableInfo.name.
- For interning, the code in weakIntern(String) method uses a Guava interner (stringInterner = Interners.newWeakInterner[String]()). This is an old-fashioned, less efficient way of interning strings. Since some 3-4 years old JDK7 version, the built-in JVM String.intern() method is much more efficient, both in terms of CPU and memory.
It is therefore suggested to add interning for value and replace the Guava interner with String.intern().