Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24827

Some memory waste in History Server by strings in AccumulableInfo objects

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.2
    • None
    • Spark Core

    Description

      I've analyzed a heap dump of Spark History Server with jxray (www.jxray.com) and found that 42% of the heap is wasted due to duplicate strings. The biggest sources of such strings are the name and value data fields of AccumulableInfo objects:

      7. Duplicate Strings:  overhead 42.1% 
      
        Total strings   Unique strings   Duplicate values  Overhead 
          13,732,278	   729,234	     354,032	     867,177K (42.1%)
      
      Expensive data fields:
      
      
      318,421K (15.4%), 3669685 / 100% dup strings (8 unique), 3669685 dup backing arrays:
      
       ↖org.apache.spark.scheduler.AccumulableInfo.name
      
      178,994K (8.7%), 3674403 / 99% dup strings (35640 unique), 3674403 dup backing arrays:
      
       ↖scala.Some.x
      
      168,601K (8.2%), 3401960 / 92% dup strings (175826 unique), 3401960 dup backing arrays:
      
       ↖org.apache.spark.scheduler.AccumulableInfo.value

      That is, 15.4% of the heap is wasted by AccumulableInfo.name and 8.2% is wasted by AccumulableInfo.value.

      It turns out that the problem has been partially addressed in spark 2.3+, e.g.

      https://github.com/apache/spark/blob/b045315e5d87b7ea3588436053aaa4d5a7bd103f/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L590

      However, this code has two minor problems:

      1. Strings for AccumulableInfo.value are not interned in the above code, only AccumulableInfo.name.
      2. For interning, the code in weakIntern(String) method uses a Guava interner (stringInterner = Interners.newWeakInterner[String]()). This is an old-fashioned, less efficient way of interning strings. Since some 3-4 years old JDK7 version, the built-in JVM String.intern() method is much more efficient, both in terms of CPU and memory.

      It is therefore suggested to add interning for value and replace the Guava interner with String.intern().

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            misha@cloudera.com Misha Dmitriev
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment