Details
Description
It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed as long as timeline client get timed out for each applications. By default, timed out will take around 5 mins. If completed applications are more then amount of time RM will wait is (number of completed applications in a cluster * 5 minutes) which is kind of hanged.
Primary reason for this behavior is YARN-3044 YARN-4129 which refactor existing system metric publisher. This refactoring made appFinished event as synchronous which was asynchronous earlier.
Attachments
Attachments
Issue Links
- is broken by
-
YARN-4129 Refactor the SystemMetricPublisher in RM to better support newer events
- Resolved