Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29640

[K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Bug
    • 2.4.4
    • None
    • Kubernetes, Spark Core
    • None

    Description

      We are running into intermittent DNS issues where the Spark driver fails to resolve "kubernetes.default.svc" when trying to create executors. We are running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.

      This happens approximately 10% of the time.

      Here is the stack trace:

      Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot be instantiated
      	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
      	at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
      	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
      	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
      	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
      	at scala.Option.getOrElse(Option.scala:121)
      	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
      	at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
      	at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
      	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
      	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
      	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
      	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
      	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace: [tenant-8-workflows]  failed.
      	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
      	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
      	at scala.Option.map(Option.scala:146)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
      	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
      	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
      	... 20 more
      Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
      	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
      	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
      	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
      	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
      	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
      	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
      	at okhttp3.Dns$1.lookup(Dns.java:39)
      	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
      	at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
      	at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
      	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
      	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
      	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
      	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
      	at okhttp3.RealCall.execute(RealCall.java:69)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
      	... 27 more  

      This issue seems to be caused by https://github.com/kubernetes/kubernetes/issues/76790

      One suggested workaround is to specify TCP mode for DNS lookups in the pod spec (https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508).

      I would like the ability to provide a flag to spark-submit to specify to use TCP mode for DNS lookups.

      I am working on a PR for this.

      Attachments

        Activity

          People

            Unassigned Unassigned
            andygrove Andy Grove
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: