Checking the code changes for the log aggregation feature, we could see that when the log aggregator is inited for each app, we do verify and create remote dir where we make an additional call to setPermission() even though the remote dir exists and the permissions are set as expected.
This code path was introduced to cater to the cloud storage where we had to make this additional check to ensure the remote file system and the corresponding cloud storage supports setting permissions.
Upstream jira that introduced this call.
This additional setPermission() call per each app/job floods the HDFS NN and its RPC queue which affects the performance overall.
The ask here is to see if it's feasible to do the following :
(a)if we can put the code introduced via
YARN-9030 behind a configuration option (may be setting this option to false by default (assuming the storage used is HDFS) to bypass this code)
(b)check if customer is using HDFS storage internally in the code (by checking yarn.nodemanager.remote-app-log-dir) and bypass this code if the storage is indeed HDFS.
given that the code introduced in
YARN-9030 is mainly put in for cloud storage providers.