[SPARK-45527] Task fraction resource request is not expected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.1, 3.3.3, 3.4.1, 3.5.0
Fix Version/s: 4.0.0
Component/s: Spark Core
Labels:
- pull-request-available

Description

test("SPARK-XXX") {
  import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}

  withTempDir { dir =>
    val scriptPath = createTempScriptWithExpectedOutput(dir, "gpuDiscoveryScript",
      """{"name": "gpu","addresses":["0"]}""")

    val conf = new SparkConf()
      .setAppName("test")
      .setMaster("local-cluster[1, 12, 1024]")
      .set("spark.executor.cores", "12")
    conf.set(TASK_GPU_ID.amountConf, "0.08")
    conf.set(WORKER_GPU_ID.amountConf, "1")
    conf.set(WORKER_GPU_ID.discoveryScriptConf, scriptPath)
    conf.set(EXECUTOR_GPU_ID.amountConf, "1")
    sc = new SparkContext(conf)
    val rdd = sc.range(0, 100, 1, 4)
    var rdd1 = rdd.repartition(3)
    val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0)
    val rp = new ResourceProfileBuilder().require(treqs).build
    rdd1 = rdd1.withResources(rp)
    assert(rdd1.collect().size === 100)
  }
}

In the above test, the 3 tasks generated by rdd1 are expected to be executed in sequence as we expect "new TaskResourceRequests().cpus(1).resource("gpu", 1.0)" should override "conf.set(TASK_GPU_ID.amountConf, "0.08")". However, those 3 tasks are run in parallel in fact.

The root cause is that ExecutorData#ExecutorResourceInfo#numParts is static. In this case, the "gpu.numParts" is initialized with 12 (1/0.08) and won't change even if there's a new task resource request (e.g., resource("gpu", 1.0) in this case). Thus, those 3 tasks are able to be executed in parallel.