This is not a new issue but async scheduling makes it worse:
In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat to RM, and AM launcher will connect NM to launch the container. Even though it is possible that NM crashes after the heartbeat, which causes AM hangs for a while. But it is related rare.
When async scheduling enabled:
- Skip node which missed X node heartbeat.
And in addition, it's better to reduce wait time by setting following configs to earlier fail a container being launched at an NM with connectivity issue.
The second part is not covered by the patch.