Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28467

Flaky test: org.apache.spark.SparkContextSuite.test driver discovery under local-cluster mode

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 3.0.0
    • None
    • Tests
    • None

    Description

      We ran unit tests on arm64 instance, and there are tests failed due to the executor can't up under the timeout 10000 ms:

      • test driver discovery under local-cluster mode *** FAILED ***
        java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
        at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
        at org.apache.spark.SparkContextSuite.$anonfun$new$78(SparkContextSuite.scala:753)
        at org.apache.spark.SparkContextSuite.$anonfun$new$78$adapted(SparkContextSuite.scala:741)
        at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
        at org.apache.spark.SparkContextSuite.$anonfun$new$77(SparkContextSuite.scala:741)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
        at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
        at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
        at org.scalatest.Transformer.apply(Transformer.scala:22)
      • test gpu driver resource files and discovery under local-cluster mode *** FAILED ***
        java.util.concurrent.TimeoutException: Can't find 1 executors before 10000 milliseconds elapsed
        at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:293)
        at org.apache.spark.SparkContextSuite.$anonfun$new$80(SparkContextSuite.scala:781)
        at org.apache.spark.SparkContextSuite.$anonfun$new$80$adapted(SparkContextSuite.scala:761)
        at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:161)
        at org.apache.spark.SparkContextSuite.$anonfun$new$79(SparkContextSuite.scala:761)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
        at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
        at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
        at org.scalatest.Transformer.apply(Transformer.scala:22)

      And then we increase the timeout to 20000(or 30000) the tests passed, I found there are other issues about the timeout increasing before, see: https://issues.apache.org/jira/browse/SPARK-7989 and https://issues.apache.org/jira/browse/SPARK-10651
      I think the timeout doesn't work well, and seems there is no principle of the timeout setting, how can I fix this? Could I increase the timeout for these two tests?

       


      This test fails intermittently in Amplab CI as well.

      https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/110681/testReport/

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            huangtianhua huangtianhua
            Sean R. Owen Sean R. Owen
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment