[SLIDER-438] Slider agent continues to run in the container on a node where NM dies - ASF JIRA

XML

Word

Printable

JSON

Steps to reproduce:

Setup a 3-node cluster (in non-HA mode)
Run slider create for HBase app-package (with HMaster and HRegionServer components only - just to keep things simple)
Let's assume that the HRegionServer came up in a node different from that of HMaster and Slider AM (if not, doing destroy-create couple of times will definitely get you to this setup)
Kill the NM in the node where HRegionServer is running
Restart the NM within 10 minutes (which is the default time after which RM marks the node as KILLED, configurable using yarn.nm.liveness-monitor.expiry-interval-ms)
At this point Slider AM received the container lost event from RM, it marked the container lost and requested a new one to RM. A new HRegionServer container came up (in the same host where the old one was running). At this point both the HRegionServer containers continued to run happily along side each other and successfully heart-beating to AM.

Expected:

Given that the first HRegionServer instance was still heart-beating with AM, AM should be able to send a kill signal and bring the agent/container down.

is duplicated by

SLIDER-428 When AppMaster received container release notification it should ask agents to go down if they are still heart-beating in