Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0
-
None
Description
The YarnAllocator doesn't properly track containers being launched but not yet running. If it takes time to launch the containers on the NM they don't show up as numExecutorsRunning, but they are already out of the Pending list, so if the allocateResources call happens again it can think it has missing executors even when it doesn't (they just haven't been launched yet).
This was introduced by SPARK-12447
Where it check for missing:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L297
Only updates the numRunningExecutors after NM has started it:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala#L524
Thus if the NM is slow or the network is slow, it can miscount and start additional executors.
Attachments
Issue Links
- is broken by
-
SPARK-12447 Only update AM's internal state when executor is successfully launched by NM
- Resolved
- is duplicated by
-
SPARK-21562 Spark may request extra containers if the rpc between YARN and spark is too fast
- Resolved
- links to