Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32371

Autodetect persistently failing executor pods and fail the application logging the cause.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • Kubernetes
    • None

    Description

      [root@kyok-test-1 ~]# kubectl get po -w
      
      NAME                                   READY   STATUS    RESTARTS   AGE
      
      spark-shell-a3962a736bf9e775-exec-36   1/1     Running   0          5s
      
      spark-shell-a3962a736bf9e775-exec-37   1/1     Running   0          3s
      
      spark-shell-a3962a736bf9e775-exec-36   0/1     Error     0          5s
      
      spark-shell-a3962a736bf9e775-exec-38   0/1     Pending   0          1s
      
      spark-shell-a3962a736bf9e775-exec-38   0/1     Pending   0          1s
      
      spark-shell-a3962a736bf9e775-exec-38   0/1     ContainerCreating   0          1s
      
      spark-shell-a3962a736bf9e775-exec-36   0/1     Terminating         0          6s
      
      spark-shell-a3962a736bf9e775-exec-36   0/1     Terminating         0          6s
      
      spark-shell-a3962a736bf9e775-exec-37   0/1     Error               0          5s
      
      spark-shell-a3962a736bf9e775-exec-38   1/1     Running             0          2s
      
      spark-shell-a3962a736bf9e775-exec-39   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-39   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-39   0/1     ContainerCreating   0          0s
      
      spark-shell-a3962a736bf9e775-exec-37   0/1     Terminating         0          6s
      
      spark-shell-a3962a736bf9e775-exec-37   0/1     Terminating         0          6s
      
      spark-shell-a3962a736bf9e775-exec-38   0/1     Error               0          4s
      
      spark-shell-a3962a736bf9e775-exec-39   1/1     Running             0          1s
      
      spark-shell-a3962a736bf9e775-exec-40   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-40   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-40   0/1     ContainerCreating   0          0s
      
      spark-shell-a3962a736bf9e775-exec-38   0/1     Terminating         0          5s
      
      spark-shell-a3962a736bf9e775-exec-38   0/1     Terminating         0          5s
      
      spark-shell-a3962a736bf9e775-exec-39   0/1     Error               0          3s
      
      spark-shell-a3962a736bf9e775-exec-40   1/1     Running             0          1s
      
      spark-shell-a3962a736bf9e775-exec-41   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-41   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-41   0/1     ContainerCreating   0          0s
      
      spark-shell-a3962a736bf9e775-exec-39   0/1     Terminating         0          4s
      
      spark-shell-a3962a736bf9e775-exec-39   0/1     Terminating         0          4s
      
      spark-shell-a3962a736bf9e775-exec-41   1/1     Running             0          2s
      
      spark-shell-a3962a736bf9e775-exec-40   0/1     Error               0          4s
      
      spark-shell-a3962a736bf9e775-exec-42   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-42   0/1     Pending             0          0s
      
      spark-shell-a3962a736bf9e775-exec-42   0/1     ContainerCreating   0          0s
      
      spark-shell-a3962a736bf9e775-exec-40   0/1     Terminating         0          4s
      
      spark-shell-a3962a736bf9e775-exec-40   0/1     Terminating         0          4s
      
      

      A cascade of creating and terminating pods within 3-4 seconds, is created. It is difficult to see the logs of these constantly created and terminated pods. Thankfully, there is an option

      spark.kubernetes.executor.deleteOnTermination false  

      to turn off the auto deletion of executor pods, and gives us opportunity to diagnose the problem. However, this is not turned on by default, and sometimes one may need to guess what caused the problem the previous run and steps to reproduce it and then re run the application with exact same setup to reproduce.

      So, it might be good, if we could somehow detect this situation, of pod failing as soon as they start or failing on particular task and capture the error that caused the pod to terminate and relay it back to driver and log it. 

      Alternatively, if we could auto-detect this situation, we can also auto stop creating more executor pods and fail with appropriate error also retaining the last failed pod for user's further investigation.

      So far it is not yet evaluated how this can be achieved, but, this feature might be useful for K8s growing as a preferred choice for deploying spark. Logging this issue for further investigation and work.

      Attachments

        Activity

          People

            Unassigned Unassigned
            prashant Prashant Sharma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: