Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32975

Add config for driver readiness timeout before executors start

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.4, 3.0.2, 3.1.2, 3.2.0
    • 3.2.0, 3.1.3
    • Kubernetes
    • None

    Description

      We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4

      spark executors keeps getting killed with exit code 1 and we are seeing following exception in the executor which goes to error state. Once this error happens, driver doesn't restart executor. 

       

      Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
      at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
      at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
      at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
      at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
      Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
      at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
      at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
      at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
      at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
      at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
      at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
      ... 4 more
      Caused by: java.io.IOException: Failed to connect to act-pipeline-app-1600187491917-driver-svc.default.svc:7078
      at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
      at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
      at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
      at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
      at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      Caused by: java.net.UnknownHostException: act-pipeline-app-1600187491917-driver-svc.default.svc
      at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
      at java.net.InetAddress.getAllByName(InetAddress.java:1193)
      at java.net.InetAddress.getAllByName(InetAddress.java:1127)
      at java.net.InetAddress.getByName(InetAddress.java:1077)
      at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
      at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
      at java.security.AccessController.doPrivileged(Native Method)
      at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
      at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
      at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
      at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
      at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
      at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
      at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
      at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
      at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
      at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
      at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
      at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
      at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
      at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
      at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
      at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
      at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
      at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
      at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
      at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
      at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
      at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
      at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
      at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
      at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
      ... 1 more
      CodeCache: size=245760Kb used=4762Kb max_used=4763Kb free=240997Kb
      bounds [0x00007f49f5000000, 0x00007f49f54b0000, 0x00007f4a04000000]
      total_blobs=1764 nmethods=1356 adapters=324
      compilation: enabled

       

       

       

      Additional information:

      The status of spark application shows it is RUNNING:

      kubectl describe sparkapplications.sparkoperator.k8s.io act-pipeline-app

      ...

      ...

      Status:

        Application State:

          State:  RUNNING

        Driver Info:

          Pod Name:             act-pipeline-app-driver

          Web UI Address:       10.233.57.201:40550

          Web UI Port:          40550

          Web UI Service Name:  act-pipeline-app-ui-svc

        Execution Attempts:     1

        Executor State:

          act-pipeline-app-1600097064694-exec-1:  RUNNING

        Last Submission Attempt Time:             2020-09-14T15:24:26Z

        Spark Application Id:                     spark-942bb2e500c54f92ac357b818c712558

        Submission Attempts:                      1

        Submission ID:                            4ecdb6ca-d237-4524-b05e-c42cfcc73dc7

        Termination Time:                         <nil>

      Events:                                     <none>

       

      The executor pod is reporting that it is Terminated:

      kubectl describe pod -l sparkoperator.k8s.io/app-name=act-pipeline-app,spark-role=executor

      ...

      ...

      Containers:

        executor:

          Container ID:  docker://9aa5b585e8fb7390b87a4771f3ed1402cae41f0fe55905d0172ed6e90dde34e6

      ...

          Ports:         7079/TCP, 8090/TCP

          Host Ports:    0/TCP, 0/TCP

          Args:

            executor

          State:          Terminated

            Reason:       Error

            Exit Code:    1

            Started:      Mon, 14 Sep 2020 11:25:35 -0400

            Finished:     Mon, 14 Sep 2020 11:25:39 -0400

          Ready:          False

          Restart Count:  0

      ...

      Conditions:

        Type              Status

        Initialized       True

        Ready             False

        ContainersReady   False

        PodScheduled      True

      ...

      QoS Class:       Burstable

      Node-Selectors:  <none>

      Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s

                       node.kubernetes.io/unreachable:NoExecute for 300s

      Events:          <none>

      In early stage of the driver’s life the failed executor is not detected (it is assumed to be running) and therefore it will not be restarted.

       

      Attachments

        Activity

          People

            cchriswu Chris Wu
            shensonj Shenson Joseph
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: