[SPARK-41848] Tasks are over-scheduled with TaskResourceProfile - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: Spark Core
Labels:
None

Description

test("SPARK-XXX") {
  val conf = new SparkConf().setAppName("test").setMaster("local-cluster[1,4,1024]")
  sc = new SparkContext(conf)
  val req = new TaskResourceRequests().cpus(3)
  val rp = new ResourceProfileBuilder().require(req).build()

  val res = sc.parallelize(Seq(0, 1), 2).withResources(rp).map { x =>
    Thread.sleep(5000)
    x * 2
  }.collect()
  assert(res === Array(0, 2))
}

In this test, tasks are supposed to be scheduled in order since each task requires 3 cores but the executor only has 4 cores. However, we noticed 2 tasks are launched concurrently from the logs.

It turns out that we used the TaskResourceProfile (taskCpus=3) of the taskset for task scheduling:

val rpId = taskSet.taskSet.resourceProfileId
val taskSetProf = sc.resourceProfileManager.resourceProfileFromId(rpId)
val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(taskSetProf, conf)

but the ResourceProfile (taskCpus=1) of the executor for updating the free cores in ExecutorData:

val rpId = executorData.resourceProfileId
val prof = scheduler.sc.resourceProfileManager.resourceProfileFromId(rpId)
val taskCpus = ResourceProfile.getTaskCpusOrDefaultForProfile(prof, conf)
executorData.freeCores -= taskCpus

which results in the inconsistency of the available cores.

Attachments

Issue Links

is caused by

SPARK-39853 Support stage level schedule for standalone cluster when dynamic allocation is disabled

Resolved

links to

[Github] Pull Request #39410 (ivoson)

Activity

People

Assignee:: Tengfei Huang

Reporter:: wuyi

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Jan/23 02:44

Updated:: 09/Jan/23 20:03

Resolved:: 09/Jan/23 20:03