[SPARK-27900] Spark driver will not exit due to an oom error - ASF JIRA

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.3, 3.0.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

This affects Spark on K8s at least as pods will run forever and makes impossible for tools like Spark Operator to report back

job status.

A spark pi job is running:

spark-pi-driver 1/1 Running 0 1h
spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

with the following setup:

apiVersion: "sparkoperator.k8s.io/v1beta1"
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark
spec:
type: Scala
mode: cluster
image: "skonto/spark:k8s-3.0.0-sa"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
arguments:

"1000000"
sparkVersion: "2.4.0"
restartPolicy:
type: Never
nodeSelector:
"spark": "autotune"
driver:
memory: "1g"
labels:
version: 2.4.0
serviceAccount: spark-sa
executor:
instances: 2
memory: "1g"
labels:
version: 2.4.0

At some point the driver fails but it is still running and so the pods are still running:

19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB)
19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB)
19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB)
19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180
19/05/31 13:29:25 INFO DAGScheduler: Submitting 1000000 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000000 tasks
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

$ kubectl describe pod spark-pi2-driver -n spark
Name: spark-pi2-driver
Namespace: spark
Priority: 0
PriorityClassName: <none>
Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
Start Time: Fri, 31 May 2019 16:28:59 +0300
Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
spark-role=driver
sparkoperator.k8s.io/app-name=spark-pi2
sparkoperator.k8s.io/launched-by-spark-operator=true
sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
version=2.4.0
Annotations: <none>
Status: Running
IP: 10.12.103.4
Controlled By: SparkApplication/spark-pi2
Containers:
spark-kubernetes-driver:
Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
Image: skonto/spark:k8s-3.0.0-sa
Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
Ports: 7078/TCP, 7079/TCP, 4040/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
driver
--properties-file
/opt/spark/conf/spark.properties
--class
org.apache.spark.examples.SparkPi
spark-internal
1000000
State: Running

In the container processes are in interruptible sleep:

PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
287 0 185 S 2344 0% 3 0% sh
294 287 185 R 1536 0% 3 0% top
1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope

Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there).

Attachments

Issue Links

links to

GitHub Pull Request #24796

GitHub Pull Request #25229

GitHub Pull Request #26161

Spark driver will not exit due to an oom error

Details

Description

Attachments

Issue Links

Activity

People

Dates