Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27900

Spark driver will not exit due to an oom error

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 2.4.3, 3.0.0
    • None
    • Spark Core
    • None

    Description

      This affects Spark on K8s at least as pods will run forever and makes impossible for tools like Spark Operator to report back

      job status.

      A spark pi job is running:

      spark-pi-driver 1/1 Running 0 1h
      spark-pi2-1559309337787-exec-1 1/1 Running 0 1h
      spark-pi2-1559309337787-exec-2 1/1 Running 0 1h

      with the following setup:

      apiVersion: "sparkoperator.k8s.io/v1beta1"
      kind: SparkApplication
      metadata:
      name: spark-pi
      namespace: spark
      spec:
      type: Scala
      mode: cluster
      image: "skonto/spark:k8s-3.0.0-sa"
      imagePullPolicy: Always
      mainClass: org.apache.spark.examples.SparkPi
      mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar"
      arguments:

      • "1000000"
        sparkVersion: "2.4.0"
        restartPolicy:
        type: Never
        nodeSelector:
        "spark": "autotune"
        driver:
        memory: "1g"
        labels:
        version: 2.4.0
        serviceAccount: spark-sa
        executor:
        instances: 2
        memory: "1g"
        labels:
        version: 2.4.0

      At some point the driver fails but it is still running and so the pods are still running:

      19/05/31 13:29:20 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
      19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.0 KiB, free 110.0 MiB)
      19/05/31 13:29:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1765.0 B, free 110.0 MiB)
      19/05/31 13:29:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-pi2-1559309337787-driver-svc.spark.svc:7079 (size: 1765.0 B, free: 110.0 MiB)
      19/05/31 13:29:23 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1180
      19/05/31 13:29:25 INFO DAGScheduler: Submitting 1000000 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
      19/05/31 13:29:25 INFO TaskSchedulerImpl: Adding task set 0.0 with 1000000 tasks
      Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space
      at scala.collection.mutable.ResizableArray.ensureSize(ResizableArray.scala:106)
      at scala.collection.mutable.ResizableArray.ensureSize$(ResizableArray.scala:96)
      at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:49)
      Mem: 2295260K used, 24458144K free, 1636K shrd, 48052K buff, 899424K cached

      $ kubectl describe pod spark-pi2-driver -n spark
      Name: spark-pi2-driver
      Namespace: spark
      Priority: 0
      PriorityClassName: <none>
      Node: gke-test-cluster-1-spark-autotune-46c36f4f-x3z9/10.138.0.44
      Start Time: Fri, 31 May 2019 16:28:59 +0300
      Labels: spark-app-selector=spark-74d8e5a8f1af428d91093dfa6ee9d661
      spark-role=driver
      sparkoperator.k8s.io/app-name=spark-pi2
      sparkoperator.k8s.io/launched-by-spark-operator=true
      sparkoperator.k8s.io/submission-id=spark-pi2-1559309336226927526
      version=2.4.0
      Annotations: <none>
      Status: Running
      IP: 10.12.103.4
      Controlled By: SparkApplication/spark-pi2
      Containers:
      spark-kubernetes-driver:
      Container ID: docker://55dadb603290b42f9ddb71959edf0224ddc7ea621ee15429941d3bcc7db9b71f
      Image: skonto/spark:k8s-3.0.0-sa
      Image ID: docker-pullable://skonto/spark@sha256:6268d760d1a006b69c7086f946e4d5d9a3b99f149832c63cfc7fe39671f5cda9
      Ports: 7078/TCP, 7079/TCP, 4040/TCP
      Host Ports: 0/TCP, 0/TCP, 0/TCP
      Args:
      driver
      --properties-file
      /opt/spark/conf/spark.properties
      --class
      org.apache.spark.examples.SparkPi
      spark-internal
      1000000
      State: Running

      In the container processes are in interruptible sleep:

      PID PPID USER STAT VSZ %VSZ CPU %CPU COMMAND
      15 1 185 S 2114m 7% 0 0% /usr/lib/jvm/java-1.8-openjdk/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx500m org.apache.spark.deploy.SparkSubmit --deploy-mode client --conf spar
      287 0 185 S 2344 0% 3 0% sh
      294 287 185 R 1536 0% 3 0% top
      1 0 185 S 776 0% 0 0% /sbin/tini -s – /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.12.103.4 --deploy-mode client --properties-file /opt/spark/conf/spark.prope

      Liveness checks might be a workaround but rest apis may be still working if threads in jvm still are running as in this case (I did check the spark ui and it was there).

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            skonto Stavros Kontopoulos
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: