Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.1.0
-
None
Description
Delay scheduling can introduce an unbounded delay and underutilization of cluster resources under the following circumstances:
1. Tasks have locality preferences for a subset of available resources
2. Tasks finish in less time than the delay scheduling.
Instead of having one delay to wait for resources with better locality, spark waits indefinitely.
As an example, consider a cluster with 100 executors, and a taskset with 500 tasks. Say all tasks have a preference for one executor, which is by itself on one host. Given the default locality wait of 3s per level, we end up with a 6s delay till we schedule on other hosts (process wait + host wait).
If each task takes 5 seconds (under the 6 second delay), then all 500 tasks get scheduled on only one executor. This means you're only using a 1% of your cluster, and you get a ~100x slowdown. You'd actually be better off if tasks took 7 seconds.
WORKAROUNDS:
(1) You can change the locality wait times so that it is shorter than the task execution time. You need to take into account the sum of all wait times to use all the resources on your cluster. For example, if you have resources on different racks, this will include the sum of "spark.locality.wait.process" + "spark.locality.wait.node" + "spark.locality.wait.rack". Those each default to "3s". The simplest way to be to set "spark.locality.wait.process" to your desired wait interval, and set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0". For example, if your tasks take ~3 seconds on average, you might set "spark.locality.wait.process" to "1s". NOTE: due to SPARK-18967, avoid setting the spark.locality.wait=0 – instead, use spark.locality.wait=1ms.
Note that this workaround isn't perfect --with less delay scheduling, you may not get as good resource locality. After this issue is fixed, you'd most likely want to undo these configuration changes.
(2) The worst case here will only happen if your tasks have extreme skew in their locality preferences. Users may be able to modify their job to controlling the distribution of the original input data.
(2a) A shuffle may end up with very skewed locality preferences, especially if you do a repartition starting from a small number of partitions. (Shuffle locality preference is assigned if any node has more than 20% of the shuffle input data – by chance, you may have one node just above that threshold, and all other nodes just below it.) In this case, you can turn off locality preference for shuffle data by setting spark.shuffle.reduceLocality.enabled=false
Attachments
Issue Links
- is duplicated by
-
SPARK-11460 Locality waits should be based on task set creation time, not last launch time
- Resolved
- is related to
-
SPARK-27214 Upgrading locality level when lots of pending tasks have been waiting more than locality.wait
- In Progress
- links to