Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
At present, when startContainers(), if NM does not contain the application, it will enter the step of INIT_APPLICATION. In the application init step, createAppDir() will be executed, and it is a blocking operation.
createAppDir() is an operation that needs to interact with an external file system. This operation is affected by the SLA of the external file system. Once the external file system has a high latency, the NM dispatcher thread of ContainerManagerImpl will be stuck. (In fact, I have seen a scene that NM stuck here for more than an hour.)
I think it would be more reasonable to move createAppDir() to the actual time of uploading log (in other threads). And according to the logRetentionPolicy, many of the containers may not get to this step, which will save a lot of interactions with external file system.