Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.23.0, 2.0.0-alpha
-
None
Description
When a container launch request is sent to the NM, if any exception occurs during the init of log aggregation then the NM goes down. The problem can be induced by situations including, but certainly not limited to: transient rpc connection issues, missing tokens, expired tokens, permissions, full/quota exceeded dfs, etc. The problem may occur with and without security enabled.
The ramification is an entire cluster can be rather easily brought down either maliciously, accidentally, or via a submission bug.