Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35334

Spark should be more resilient to intermittent K8s flakiness

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.3.0
    • Component/s: Kubernetes
    • Labels:
      None

      Description

      Internal K8s errors such as an etcdserver leader election is propagated to the API client and could cause serious issues in Spark, like:

      Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at:
      https://kubernetes.default.svc/api/v1/namespaces/dex-app-bl24w4z9/pods/sparkpi-10-fcd3f6781a874212-driver. Message: etcdserver: 
      leader changed. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=etcdserver: leader changed, 
      metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null, 
      status=Failure, additionalProperties={}).
      

      First I try to fix in kubernetes-client by adding retries with exponential backoff:
      https://github.com/fabric8io/kubernetes-client/issues/3087

      If I manage it then this will could be just version update and introducing some new configs in Spark.

        Attachments

          Activity

            People

            • Assignee:
              attilapiros Attila Zsolt Piros
              Reporter:
              attilapiros Attila Zsolt Piros
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: