Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1
-
None
-
None
-
YARN
Description
Spark waits for spark.locality.waits period before going from RACK_LOCAL to ANY when selecting an executor for assignment. The timeout was essentially reset each time a new assignment is made.
We were running Spark streaming on Kafka with a 10 second batch window on 32 Kafka partitions with 16 executors. All executors were in the ANY group. At one point one RACK_LOCAL executor was added and all tasks were assigned to it. Each task took about 0.6 second to process, resetting the spark.locality.wait timeout (3000ms) repeatedly. This caused the whole process to under utilize resources and created an increasing backlog.
spark.locality.wait should be based on the task set creation time, not last launch time so that after 3000ms of initial creation, all executors can get tasks assigned to them.
We are specifying a zero timeout for now as a workaround to disable locality optimization.
Attachments
Issue Links
- duplicates
-
SPARK-18886 Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
- Resolved
- links to