Thanks Ming Ma for replying the comments.
Yes, the approach taken in YARN-4131 is simpler by leveraging the existing protocol (to accomplish the kill container scenario. But changing the NM-RM protocol will allow us to support other useful scenarios besides kill container and thread dump.
Agree. I don't mean the previous approach (YARN-4131) can replace the approach here. Just want to understand if the approach here can cover all cases that YARN-4131 try to address. Sounds like we still need YARN-4131's approach even when patch here goes in. Please see comments below for details.
Kill container via preemption. This means RM will know about it first before NM, different from the signal container order which kills container without RM's knowledge first. It seems killing container without RM knowledge matches container crash test case better. But killing container via preemption can simulate preemption. But does it matter here as long as container is killed?
Yes. It does matter. Number of preempted containers won't be count as container failure in AM prospective and won't affect the success in application's running result. In some tests, we need to emulate both cases instead of one.
Container Expiration. Is that only for a container that has been allocated/acquired before it is in running state? It seems it is used by RM to time out on container allocation/acquisition. It will trigger RMContainerEventType.EXPIRE and won't have impact on running container.
Sorry. I mean container LOST situation. If we want to emulate the case NM get shutdown (kill -9) suddenly and never come back and its impact to RMContainers. We may not achieve this by NM-RM protocol but better to generate some timeout event from RM directly?
My overall thinking is there could be two kinds of source that affect containers' state (in RM stand point): first is state update event trigger from container/NM, include mainstream cases for container's lifecycle which is well addressed with approach here; the other is some events generated in RM itself, like: resource/container preemption, lose contact with NM with running containers, etc. I would prefer YARN-4131 to address 2nd sources event as an addendum to our approach here. What do you think?
BTW, Sounds like test failure in TestContainerManager.testForcefulShutdownSignal is related?