Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40954

Kubernetes integration tests stuck forever on Mac M1 with Minikube + Docker

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.3.1
    • None
    • Kubernetes, Tests
    • None
    • MacOS 12.6 (Mac M1)

      Minikube 1.27.1

      Docker 20.10.17

    Description

      Description

      I tried running Kubernetes integration tests with the Minikube backend (+ Docker driver) from commit c26d99e3f104f6603e0849d82eca03e28f196551 on Spark's master branch. I ran them with the following command:

       

      mvn integration-test -am -pl :spark-kubernetes-integration-tests_2.12 \
                              -Pkubernetes -Pkubernetes-integration-tests \
                              -Phadoop-3 \
                              -Dspark.kubernetes.test.imageTag=MY_IMAGE_TAG_HERE \
                              -Dspark.kubernetes.test.imageRepo=docker.io/kubespark \
                              -Dspark.kubernetes.test.namespace=spark \
                              -Dspark.kubernetes.test.serviceAccountName=spark \
                              -Dspark.kubernetes.test.deployMode=minikube  

      However the test suite got stuck literally for hours on my machine. 

       

      Investigation

      I ran jstack on the process that was running the tests and saw that it was stuck here:

       

      "ScalaTest-main-running-KubernetesSuite" #1 prio=5 os_prio=31 tid=0x00007f78d580b800 nid=0x2503 runnable [0x0000000304749000]
         java.lang.Thread.State: RUNNABLE
          at java.io.FileInputStream.readBytes(Native Method)
          at java.io.FileInputStream.read(FileInputStream.java:255)
          at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
          at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
          - locked <0x000000076c0b6f40> (a java.lang.UNIXProcess$ProcessPipeInputStream)
          at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
          at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
          at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
          - locked <0x000000076c0bb410> (a java.io.InputStreamReader)
          at java.io.InputStreamReader.read(InputStreamReader.java:184)
          at java.io.BufferedReader.fill(BufferedReader.java:161)
          at java.io.BufferedReader.readLine(BufferedReader.java:324)
          - locked <0x000000076c0bb410> (a java.io.InputStreamReader)
          at java.io.BufferedReader.readLine(BufferedReader.java:389)
          at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:74)
          at scala.collection.Iterator.foreach(Iterator.scala:943)
          at scala.collection.Iterator.foreach$(Iterator.scala:943)
          at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
          at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2(ProcessUtils.scala:45)
          at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.$anonfun$executeProcess$2$adapted(ProcessUtils.scala:45)
          at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$$$Lambda$322/20156341.apply(Unknown Source)
          at org.apache.spark.deploy.k8s.integrationtest.Utils$.tryWithResource(Utils.scala:49)
          at org.apache.spark.deploy.k8s.integrationtest.ProcessUtils$.executeProcess(ProcessUtils.scala:45)
          at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.executeMinikube(Minikube.scala:103)
          at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.minikubeServiceAction(Minikube.scala:112)
          at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$getServiceUrl$1(DepsTestsSuite.scala:281)
          at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$611/1461360262.apply(Unknown Source)
          at org.scalatest.enablers.Retrying$$anon$4.makeAValiantAttempt$1(Retrying.scala:184)
          at org.scalatest.enablers.Retrying$$anon$4.tryTryAgain$2(Retrying.scala:196)
          at org.scalatest.enablers.Retrying$$anon$4.retry(Retrying.scala:226)
          at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:313)
          at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:312)
          at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:457)
          at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.getServiceUrl(DepsTestsSuite.scala:278)
          at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.tryDepsTest(DepsTestsSuite.scala:325)
          at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite.$anonfun$$init$$1(DepsTestsSuite.scala:160)
          at org.apache.spark.deploy.k8s.integrationtest.DepsTestsSuite$$Lambda$178/1750286943.apply$mcV$sp(Unknown Source)
      [...]

       So the issue is coming from DepsTestsSuite when it is setting up minio. After creating the minio StatefulSet and Service, it executes the 'minikube service -n spark minio-s3 --url' command. It then gets stuck in ProcessUtils while reading minikube's stdout here.

      I then ran the same command from my shell and confirmed that it never returns until a CTRL+C:

      $ minikube service -n spark minio-s3 --url
      http://127.0.0.1:63114
      ❗  Because you are using a Docker driver on darwin, the terminal needs to be open to run it.
      
      <COMMAND IS STILL RUNNING HERE>

      So it looks like it's the normal behaviour for the 'minikube service' command on Mac with the Docker driver: it needs to keep an open tunnel. I had a quick look at Minikube's source code and it seems to be happening here: https://github.com/kubernetes/minikube/blob/abed8b7d347ae15fe9c0acd91b5b49b3b6494a53/cmd/minikube/cmd/service.go#L154

      It also seems to be confirmed by the docs: https://minikube.sigs.k8s.io/docs/handbook/accessing/ 

      So the code which reads from stdout hangs indefinitely because of that. I was able to reproduce with a self-contained example as well, see attached TestProcess.scala file (it assumes that there is a minio-s3 Service in the spark Namespace).

       

      I am not sure what would be the best solution here. I think ideally, we should run the  'minikube service' command, then retrieve the URL without blocking but at the same time we should make sure to leave the command running. When the DepsTestsSuite terminates, we shouldn't forget to terminate the minikube too.

      Attachments

        1. TestProcess.scala
          2 kB
          Anton Ippolitov

        Activity

          People

            Unassigned Unassigned
            _anton Anton Ippolitov
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: