YARN-3518 is a separate concern with different ramifications. We should discuss it there and not mix these two.
set this to a bigger value maybe based on network partition considerations not only for nm restart.
What value do you propose? As pointed out earlier, anything over 10 minutes is pointless since the container allocation expires in that time. Is it common for network partitions to take longer than 3 minutes but less than 10 minutes? If so we should tune the value for that. If not then making the value larger just slows recovery time.
3 mins seems dangerous, If rm fails over and the recover takes serval mins, nm maybe kill all containers, in production env, it's not expected.
This JIRA is configuring the amount of time NM clients (i.e.: primarily ApplicationMasters and the RM when launching ApplicationMasters) will try to connect to a particular NM before failing. I'm missing how RM failover leads to a mass killing of containers due to this proposed change. This is not a property used by the NM, so the NM is not going to start killing all containers differently based on an updated value for it. The only case where the RM will use this property is when connecting to NMs to launch AM containers, and it will only do so for NMs that have recently heartbeated. Could you explain how this leads to all containers getting killed on a particular node?