[SPARK-12511] streaming driver with checkpointing unable to finalize leading to OOM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.2, 1.6.0
Fix Version/s: 1.6.1, 2.0.0
Component/s: DStreams, PySpark
Labels:
None
Environment:

pyspark 1.5.2
yarn 2.6.0
python 2.6
centos 6.5
openjdk 1.8.0

Description

Spark streaming application when configured with checkpointing is filling driver's heap with multiple ZipFileInputStream instances as results of spark-assembly.jar (potentially some others like for example snappy-java.jar) getting repetitively referenced (loaded?). Java Finalizer can't finalize these ZipFileInputStream instances and it eventually takes all heap leading the driver to OOM crash.

Steps to reproduce:

Submit attached bug.py to spark

Leave it running and monitor the driver java process heap

with heap dump you will primarily see growing instances of byte array data (here cumulated zip payload of the jar refs):

 num     #instances         #bytes  class name
----------------------------------------------
   1:         32653       32735296  [B
   2:         48000        5135816  [C
   3:            41        1344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
   4:         11362        1261816  java.lang.Class
   5:         47054        1129296  java.lang.String
   6:         25460        1018400  java.lang.ref.Finalizer
   7:          9802         789400  [Ljava.lang.Object;

with visualvm you can see:
- increasing number of objects pending for finalization
- increasing number of ZipFileInputStreams instances related to the spark-assembly.jar referenced by Finalizer

Depending on the heap size and running time this will lead to driver OOM crash

Comments

The bug.py is lightweight proof of the problem. In production I am experiencing this as quite rapid effect - in few hours it eats gigs of heap and kills the app.
If the same bug.py is run without checkpointing there is no issue whatsoever.
Not sure if it is just pyspark related.
In bug.py I am using the socketTextStream input but seems to be independent of the input type (in production having same problem with Kafka direct stream, have seen it even with textFileStream).
It is happening even if the input stream doesn't produce any data.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

bug.py
23/Dec/15 22:36
2 kB
Antony Mayi
finalizer-spark_assembly.png
23/Dec/15 22:25
9 kB
Antony Mayi
finalizer-pending.png
23/Dec/15 22:25
2 kB
Antony Mayi
finalizer-classes.png
23/Dec/15 22:25
3 kB
Antony Mayi

Issue Links

is blocked by

SPARK-12652 Upgrade py4j to the incoming version 0.9.1

Resolved

is duplicated by

SPARK-11711 Finalizer memory leak is pyspark

Resolved

relates to

SPARK-11711 Finalizer memory leak is pyspark

Resolved

links to

[Github] Pull Request #10514 (zsxwing)

Activity

People

Assignee:: Shixiong Zhu

Reporter:: Antony Mayi

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 23/Dec/15 22:25

Updated:: 11/Apr/16 20:04

Resolved:: 05/Jan/16 21:49