Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.2
-
None
Description
I've analyzed a heap dump of Spark History Server with jxray (www.jxray.com) and found that 42% of the heap is wasted due to duplicate strings. The biggest sources of such strings are the name and value data fields of AccumulableInfo objects:
7. Duplicate Strings: overhead 42.1% Total strings Unique strings Duplicate values Overhead 13,732,278 729,234 354,032 867,177K (42.1%) Expensive data fields: 318,421K (15.4%), 3669685 / 100% dup strings (8 unique), 3669685 dup backing arrays: ↖org.apache.spark.scheduler.AccumulableInfo.name 178,994K (8.7%), 3674403 / 99% dup strings (35640 unique), 3674403 dup backing arrays: ↖scala.Some.x 168,601K (8.2%), 3401960 / 92% dup strings (175826 unique), 3401960 dup backing arrays: ↖org.apache.spark.scheduler.AccumulableInfo.value
That is, 15.4% of the heap is wasted by AccumulableInfo.name and 8.2% is wasted by AccumulableInfo.value.
It turns out that the problem has been partially addressed in spark 2.3+, e.g.
However, this code has two minor problems:
- Strings for AccumulableInfo.value are not interned in the above code, only AccumulableInfo.name.
- For interning, the code in weakIntern(String) method uses a Guava interner (stringInterner = Interners.newWeakInterner[String]()). This is an old-fashioned, less efficient way of interning strings. Since some 3-4 years old JDK7 version, the built-in JVM String.intern() method is much more efficient, both in terms of CPU and memory.
It is therefore suggested to add interning for value and replace the Guava interner with String.intern().