Status: Patch Available
Affects Version/s: 3.1.3
Fix Version/s: None
This issue introduce an method to lazy init appLogAggregatorImpl, which let it access hdfs as later as possible (when the app finish usually), to avoid access hdfs at same time when restart NMs in a cluster and reduce hdfs pressure. Lets go into the details below.
In current version, app log aggregator will check HDFS and try to create log app when init an app. This cause a problem when restart NMs in a large cluster with a heavy hdfs. Restart NM will init all apps on a NM and the NM will try to connect HDFS. If the HDFS is heavily loaded, many NMs restart at same time will let the hdfs not respond. The NM will wait for HDFS's response and RM can't get NM's heartbeat and treat all containers as timeout.
In our product environment with 3500+ NMs, we find the NMs restart will put heavy pressure on HDFS and the init app's operation is blocked on accessing hdfs (stack attached blow), which causes all the container failed (we can find the container number in one NM fall to zero).
We solve this problem by introduce lazy initialization in appLogAggregatorImpl. When init app, we just create appLogAggregatorImpl object with out verifyAndCreateRemoteLogDir(). We do the verifyAndCreateRemoteLogDir() when the app start aggregate logs. Because apps always are not finish or aggregate log at same time, the verifyAndCreateRemoteLogDir will execute dispersedly, which means NMs will not access hdfs at same time when they restart at same time.
YARN-8418 solve the container logs' directory leaked problem by add a way to update credentials of NM. If we lazy init appLogAggregatorImpl, we don't need YARN-8418's logic because the lazy init logic happens after addCredentials logic, which means the credentials always refreshed before we use it.
In summary, this issue do two things:
- Introducing a lazy init logic to appLogAggregatorImpl to avoid centralized access HDFS when restart all NMs in a cluster.
YARN-8481because the lazy init logic guarantee refreshing the credentials.