Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29640

[K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Bug
    • Affects Version/s: 2.4.4
    • Fix Version/s: None
    • Component/s: Kubernetes, Spark Core
    • Labels:
      None

      Description

      We are running into intermittent DNS issues where the Spark driver fails to resolve "kubernetes.default.svc" when trying to create executors. We are running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.

      This happens approximately 10% of the time.

      Here is the stack trace:

      Exception in thread "main" org.apache.spark.SparkException: External scheduler cannot be instantiated
      	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
      	at org.apache.spark.SparkContext.<init>(SparkContext.scala:493)
      	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
      	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
      	at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
      	at scala.Option.getOrElse(Option.scala:121)
      	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
      	at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
      	at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
      	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
      	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
      	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
      	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
      	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]  for kind: [Pod]  with name: [wf-50000-69674f15d0fc45-1571354060179-driver]  in namespace: [tenant-8-workflows]  failed.
      	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
      	at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
      	at scala.Option.map(Option.scala:146)
      	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.<init>(ExecutorPodsAllocator.scala:55)
      	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
      	at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
      	... 20 more
      Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
      	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
      	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
      	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
      	at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
      	at java.net.InetAddress.getAllByName(InetAddress.java:1193)
      	at java.net.InetAddress.getAllByName(InetAddress.java:1127)
      	at okhttp3.Dns$1.lookup(Dns.java:39)
      	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171)
      	at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:137)
      	at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:82)
      	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:171)
      	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
      	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
      	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:110)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
      	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
      	at okhttp3.RealCall.execute(RealCall.java:69)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:404)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:365)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:330)
      	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:311)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:810)
      	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:218)
      	... 27 more  

      This issue seems to be caused by https://github.com/kubernetes/kubernetes/issues/76790

      One suggested workaround is to specify TCP mode for DNS lookups in the pod spec (https://github.com/kubernetes/kubernetes/issues/56903#issuecomment-424498508).

      I would like the ability to provide a flag to spark-submit to specify to use TCP mode for DNS lookups.

      I am working on a PR for this.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              andygrove Andy Grove
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: