Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34645

[K8S] Driver pod stuck in Running state after job completes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.2
    • None
    • Kubernetes
    • None

    Description

      I am running automated benchmarks in k8s, using spark-submit in cluster mode, so the driver runs in a pod.

      When running with Spark 3.0.1 and 3.1.1 everything works as expected and I see the Spark context being shut down after the job completes.

      However, when running with Spark 3.0.2 I do not see the context get shut down and the driver pod is stuck in the Running state indefinitely.

      This is the output I see after job completion with 3.0.1 and 3.1.1 and this output does not appear with 3.0.2. With 3.0.2 there is no output at all after the job completes.

      2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from shutdown hook
      2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
      2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
      2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting down all executors
      2021-03-05 20:09:24,600 INFO k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
      2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
      2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
      2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
      2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
      2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
      2021-03-05 20:09:24,752 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
      2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped SparkContext
      2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
      2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
      2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
      2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
      2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
      2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
       

      Attachments

        1. dump.txt
          11 kB
          Michael Negodaev

        Issue Links

          Activity

            People

              Unassigned Unassigned
              andygrove Andy Grove
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: