Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12511

streaming driver with checkpointing unable to finalize leading to OOM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.5.2, 1.6.0
    • 1.6.1, 2.0.0
    • DStreams, PySpark
    • None
    • pyspark 1.5.2
      yarn 2.6.0
      python 2.6
      centos 6.5
      openjdk 1.8.0

    Description

      Spark streaming application when configured with checkpointing is filling driver's heap with multiple ZipFileInputStream instances as results of spark-assembly.jar (potentially some others like for example snappy-java.jar) getting repetitively referenced (loaded?). Java Finalizer can't finalize these ZipFileInputStream instances and it eventually takes all heap leading the driver to OOM crash.

      Steps to reproduce:

      • Submit attached bug.py to spark
      • Leave it running and monitor the driver java process heap
        • with heap dump you will primarily see growing instances of byte array data (here cumulated zip payload of the jar refs):
           num     #instances         #bytes  class name
          ----------------------------------------------
             1:         32653       32735296  [B
             2:         48000        5135816  [C
             3:            41        1344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
             4:         11362        1261816  java.lang.Class
             5:         47054        1129296  java.lang.String
             6:         25460        1018400  java.lang.ref.Finalizer
             7:          9802         789400  [Ljava.lang.Object;
          
        • with visualvm you can see:
          • increasing number of objects pending for finalization
          • increasing number of ZipFileInputStreams instances related to the spark-assembly.jar referenced by Finalizer
      • Depending on the heap size and running time this will lead to driver OOM crash

      Comments

      • The bug.py is lightweight proof of the problem. In production I am experiencing this as quite rapid effect - in few hours it eats gigs of heap and kills the app.
      • If the same bug.py is run without checkpointing there is no issue whatsoever.
      • Not sure if it is just pyspark related.
      • In bug.py I am using the socketTextStream input but seems to be independent of the input type (in production having same problem with Kafka direct stream, have seen it even with textFileStream).
      • It is happening even if the input stream doesn't produce any data.

      Attachments

        1. bug.py
          2 kB
          Antony Mayi
        2. finalizer-classes.png
          3 kB
          Antony Mayi
        3. finalizer-pending.png
          2 kB
          Antony Mayi
        4. finalizer-spark_assembly.png
          9 kB
          Antony Mayi

        Issue Links

          Activity

            People

              zsxwing Shixiong Zhu
              antonymayi Antony Mayi
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: