Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12511

streaming driver with checkpointing unable to finalize leading to OOM

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.5.2, 1.6.0
    • Fix Version/s: 1.6.1, 2.0.0
    • Component/s: DStreams, PySpark
    • Labels:
      None
    • Environment:

      pyspark 1.5.2
      yarn 2.6.0
      python 2.6
      centos 6.5
      openjdk 1.8.0

      Description

      Spark streaming application when configured with checkpointing is filling driver's heap with multiple ZipFileInputStream instances as results of spark-assembly.jar (potentially some others like for example snappy-java.jar) getting repetitively referenced (loaded?). Java Finalizer can't finalize these ZipFileInputStream instances and it eventually takes all heap leading the driver to OOM crash.

      Steps to reproduce:

      • Submit attached bug.py to spark
      • Leave it running and monitor the driver java process heap
        • with heap dump you will primarily see growing instances of byte array data (here cumulated zip payload of the jar refs):
           num     #instances         #bytes  class name
          ----------------------------------------------
             1:         32653       32735296  [B
             2:         48000        5135816  [C
             3:            41        1344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
             4:         11362        1261816  java.lang.Class
             5:         47054        1129296  java.lang.String
             6:         25460        1018400  java.lang.ref.Finalizer
             7:          9802         789400  [Ljava.lang.Object;
          
        • with visualvm you can see:
          • increasing number of objects pending for finalization
          • increasing number of ZipFileInputStreams instances related to the spark-assembly.jar referenced by Finalizer
      • Depending on the heap size and running time this will lead to driver OOM crash

      Comments

      • The bug.py is lightweight proof of the problem. In production I am experiencing this as quite rapid effect - in few hours it eats gigs of heap and kills the app.
      • If the same bug.py is run without checkpointing there is no issue whatsoever.
      • Not sure if it is just pyspark related.
      • In bug.py I am using the socketTextStream input but seems to be independent of the input type (in production having same problem with Kafka direct stream, have seen it even with textFileStream).
      • It is happening even if the input stream doesn't produce any data.

        Attachments

        1. finalizer-classes.png
          3 kB
          Antony Mayi
        2. finalizer-pending.png
          2 kB
          Antony Mayi
        3. finalizer-spark_assembly.png
          9 kB
          Antony Mayi
        4. bug.py
          2 kB
          Antony Mayi

          Issue Links

            Activity

              People

              • Assignee:
                zsxwing Shixiong Zhu
                Reporter:
                antonymayi Antony Mayi
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: