[SPARK-47458] Incorrect to calculate the concurrent task number - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 4.0.0
Component/s: Spark Core
Labels:
- pull-request-available

Description

The below test case failed,

test("problem of calculating the maximum concurrent task") {
  withTempDir { dir =>
    val discoveryScript = createTempScriptWithExpectedOutput(
      dir, "gpuDiscoveryScript", """{"name": "gpu","addresses":["0", "1", "2", "3"]}""")

    val conf = new SparkConf()
      // Setup a local cluster which would only has one executor with 2 CPUs and 1 GPU.
      .setMaster("local-cluster[1, 6, 1024]")
      .setAppName("test-cluster")
      .set(WORKER_GPU_ID.amountConf, "4")
      .set(WORKER_GPU_ID.discoveryScriptConf, discoveryScript)
      .set(EXECUTOR_GPU_ID.amountConf, "4")
      .set(TASK_GPU_ID.amountConf, "2")
      // disable barrier stage retry to fail the application as soon as possible
      .set(BARRIER_MAX_CONCURRENT_TASKS_CHECK_MAX_FAILURES, 1)
    sc = new SparkContext(conf)
    TestUtils.waitUntilExecutorsUp(sc, 1, 60000)

    // Setup a barrier stage which contains 2 tasks and each task requires 1 CPU and 1 GPU.
    // Therefore, the total resources requirement (2 CPUs and 2 GPUs) of this barrier stage
    // can not be satisfied since the cluster only has 2 CPUs and 1 GPU in total.
    assert(sc.parallelize(Range(1, 10), 2)
      .barrier()
      .mapPartitions { iter => iter }
      .collect() sameElements Range(1, 10).toArray[Int])
  }
}

The error log

~~SPARK-24819~~: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.
org.apache.spark.scheduler.BarrierJobSlotsNumberCheckFailed: ~~SPARK-24819~~: Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently. Please init a new cluster with more resources(e.g. CPU, GPU) or repartition the input RDD(s) to reduce the number of slots required to run this barrier stage.
at org.apache.spark.errors.SparkCoreErrors$.numPartitionsGreaterThanMaxNumConcurrentTasksError(SparkCoreErrors.scala:241)
at org.apache.spark.scheduler.DAGScheduler.checkBarrierStageWithNumSlots(DAGScheduler.scala:576)
at org.apache.spark.scheduler.DAGScheduler.createResultStage(DAGScheduler.scala:654)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1321)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3055)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3046)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3035)

Attachments

Issue Links

relates to

SPARK-45527 Task fraction resource request is not expected

Resolved

links to

GitHub Pull Request #45528

Activity

People

Assignee:: Bobby Wang

Reporter:: Bobby Wang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Mar/24 06:41

Updated:: 01/Apr/24 02:46

Resolved:: 19/Mar/24 15:08