It is observed that if ATS1/1.5 daemon is not running, RM recovery is delayed as long as timeline client get timed out for each applications. By default, timed out will take around 5 mins. If completed applications are more then amount of time RM will wait is (number of completed applications in a cluster * 5 minutes) which is kind of hanged.
Primary reason for this behavior is
YARN-3044 YARN-4129 which refactor existing system metric publisher. This refactoring made appFinished event as synchronous which was asynchronous earlier.
- is broken by
YARN-4129 Refactor the SystemMetricPublisher in RM to better support newer events