Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23980

Resilient Spark driver on Kubernetes

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.0
    • None
    • Spark Core

    Description

      The current implementation of `Spark driver` on Kubernetes is not resilient to node failures as it’s implemented as a `Pod`. In case of a node failure Kubernetes terminates the pods that were running on that node. Kubernetes doesn't reschedule these pods to any of the other nodes of the cluster.

      If the `driver` is implemented as Kubernetes Job than it will be rescheduled to other node.

      When the driver is terminated its executors (that may run on other nodes) are terminated by Kubernetes with some delay by Kubernetes Garbage collection.

      This can lead to concurrency issues where the re-spawned `driver` was trying to create new executors with same name as the executors being in the middle of being cleaned up by Kubernetes garbage collection.

      To solve this issue the executor name must be made unique for each `driver` instance.

      The PR linked to this lira is an implementation of the above that creates spark driver as a Job and ensures that executor pod names are unique per driver instance.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            stoader Sebastian Toader
            Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment