Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29443

Spark-on-Kubernetes: driver pod keeps spinning/terminating executor pods

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.4.4
    • Fix Version/s: None
    • Environment:

      EKS version: v1.14.6-eks-5047ed with 18 x r5a.12xlarge EC2 instances as Worker nodes
      Spark version: 2.4.4
      Kubernetes client version: 4.4.2

      Description

      We are running Spark on Kubernetes (EKS), there are some Spark jobs which driver pod keep creating and terminating executor pods. Executor pods only live for few seconds. Some jobs have been stuck for days, and spun up to thousand executor pods.

      ➜ kubectl get pods --field-selector=status.phase=Running -n spark
      NAME READY STATUS RESTARTS AGE
      log-parser-k8s.structured-log.growth-referral-late-1570797184638 1/1 Running 0 111m
      log-parser-k8s.structured-log.growth-referral-late-1570798665864 1/1 Running 0 87m
      log-parser-k8s.structured-log.growth-referral-late-1570801912533 1/1 Running 0 33m
      log-parser-k8s.structured-log.growth-referral-late-1570803823618 1/1 Running 0 77s
      log-parser-k8s.structured-log.orbital-latest-struc-1570797051892 1/1 Running 0 114m
      log-parser-k8s.structured-log.orbital-latest-struc-1570803823552 1/1 Running 0 77s
      logparserstructured-1570797185983-exec-1727 1/1 Running 0 4s
      logparserstructured-1570801913787-exec-431 1/1 Running 0 4s
      logparserstructured-1570803824926-exec-1 1/1 Running 0 69s
      logparserstructured-1570803824926-exec-2 1/1 Running 0 69s
      

      Run kubectl get pods, we can see bunch of executor pods are being terminated.

      logparserstructured-1570786196291-exec-1511 0/1 Terminating 0 7s
      logparserstructured-1570786196291-exec-1513 0/1 Terminating 0 6s
      logparserstructured-1570786196291-exec-1516 0/1 Terminating 0 5s
      logparserstructured-1570786196291-exec-1518 0/1 Terminating 0 4s
      logparserstructured-1570786196291-exec-1519 0/1 Terminating 0 3s
      logparserstructured-1570786196291-exec-1520 0/1 Terminating 0 3s
      logparserstructured-1570786196291-exec-1521 0/1 Terminating 0 2s
      logparserstructured-1570786196291-exec-1522 0/1 Terminating 0 2s
      logparserstructured-1570786196291-exec-1523 0/1 Terminating 0 1s
      logparserstructured-1570786196291-exec-1524 0/1 Terminating 0 1s
      logparserstructured-1570786469589-exec-1487 0/1 Terminating 0 7s
      logparserstructured-1570786469589-exec-1489 0/1 Terminating 0 6s
      logparserstructured-1570786469589-exec-1491 0/1 Terminating 0 5s
      logparserstructured-1570786469589-exec-1493 0/1 Terminating 0 4s
      logparserstructured-1570786469589-exec-1494 0/1 Terminating 0 3s
      logparserstructured-1570786469589-exec-1495 0/1 Terminating 0 2s
      logparserstructured-1570786469589-exec-1496 0/1 Terminating 0 2s
      logparserstructured-1570786469589-exec-1497 0/1 Terminating 0 1s
      logparserstructured-1570786469589-exec-1498 0/1 Terminating 0 1s
      logparserstructured-1570786469589-exec-1499 0/1 Terminating 0 0s
      logparserstructured-1570786469589-exec-1500 0/1 Terminating 0 0s
      logparserstructured-1570787682660-exec-1293 0/1 Terminating 0 8s
      logparserstructured-1570787682660-exec-1299 0/1 Terminating 0 5s
      logparserstructured-1570787682660-exec-1300 0/1 Terminating 0 5s
      logparserstructured-1570787682660-exec-1301 0/1 Terminating 0 4s
      logparserstructured-1570787682660-exec-1303 0/1 Terminating 0 3s
      logparserstructured-1570787682660-exec-1305 0/1 Terminating 0 2s
      logparserstructured-1570787682660-exec-1306 0/1 Terminating 0 2s
      logparserstructured-1570787682660-exec-1307 0/1 Terminating 0 1s
      logparserstructured-1570787682660-exec-1308 0/1 Terminating 0 1s
      

      Checked driver pod logs:

      19/10/11 09:54:51 INFO Utils: Successfully started service 'SparkUI' on port 4040.
      19/10/11 09:54:51 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://logparserstructured-1570787682660-driver-svc.spark.svc:4040
      19/10/11 09:54:51 INFO SparkContext: Added JAR s3://datastore/source-code/log-analytics-assembly-2.4.0.jar at s3://datastore/source-code/log-analytics-assembly-2.4.0.jar with timestamp 1570787691192
      19/10/11 09:54:52 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
      19/10/11 09:54:52 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
      19/10/11 09:54:52 INFO NettyBlockTransferService: Server created on logparserstructured-1570787682660-driver-svc.spark.svc:7079
      19/10/11 09:54:52 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
      19/10/11 09:54:52 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
      19/10/11 09:54:52 INFO BlockManagerMasterEndpoint: Registering block manager logparserstructured-1570787682660-driver-svc.spark.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
      19/10/11 09:54:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
      19/10/11 09:54:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, logparserstructured-1570787682660-driver-svc.spark.svc, 7079, None)
      19/10/11 09:54:52 ERROR Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1
      io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/spark/pods. Message: pods "logparserstructured-1570787682660-exec-1" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=pods, name=logparserstructured-1570787682660-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "logparserstructured-1570787682660-exec-1" already exists, metadata=ListMeta(_continue=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
       at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:503)
       at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:442)
       at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:406)
       at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
       at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:234)
       at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:796)
       at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:326)
       at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:322)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots$1.apply$mcVI$sp(ExecutorPodsAllocator.scala:139)
       at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsAllocator$$onNewSnapshots(ExecutorPodsAllocator.scala:126)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$start$1.apply(ExecutorPodsAllocator.scala:68)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:102)
       at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$$callSubscriber(ExecutorPodsSnapshotsStoreImpl.scala:99)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anonfun$addSubscriber$1.apply$mcV$sp(ExecutorPodsSnapshotsStoreImpl.scala:71)
       at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$$anon$1.run(ExecutorPodsSnapshotsStoreImpl.scala:107)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      
      9/10/11 09:55:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
      19/10/11 09:55:37 INFO BlockManagerMaster: Removal of executor 8 requested
      19/10/11 09:55:37 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 8
      19/10/11 09:55:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
      19/10/11 09:55:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
      19/10/11 09:56:00 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
      19/10/11 09:56:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
      19/10/11 09:56:02 INFO BlockManagerMaster: Removal of executor 9 requested
      19/10/11 09:56:02 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 9
      19/10/11 09:56:02 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
      19/10/11 09:56:04 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
      19/10/11 09:56:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
      19/10/11 09:56:05 INFO BlockManagerMaster: Removal of executor 10 requested
      19/10/11 09:56:05 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 10
      19/10/11 09:56:06 INFO BlockManagerMaster: Removal of executor 11 requested
      19/10/11 09:56:06 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
      19/10/11 09:56:06 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 11
      19/10/11 09:56:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
      19/10/11 09:56:22 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
      19/10/11 09:56:24 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
      19/10/11 09:56:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster.
      19/10/11 09:56:25 INFO BlockManagerMaster: Removal of executor 12 requested
      19/10/11 09:56:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 12
      19/10/11 09:56:26 INFO BlockManagerMasterEndpoint: Trying to remove executor 13 from BlockManagerMaster.
      19/10/11 09:56:26 INFO BlockManagerMaster: Removal of executor 13 requested
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              anhtu.nguyen James Nguyen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: