We can do the cleanup(i.e. stop active services) when we switch to standby. We do this already. Also cleanup will be done when we stop RM. So this shouldn't be an issue.
What is happening is as under :
Let us assume there is RM1 and RM2.
Basically, when exception occurs, RM1 waits for RM2 to become active and joins leader election again. As both RMs' have wrong configuration, RM1 will try to become active again(and not switch to standby) after RM2 has tried the same.
Now, as the problem is in call to refreshAll, both RMs' would be marked as ACTIVE in their respective RM Contexts. Because we set it to ACTIVE before calling refreshAll.
The problem reported here is that RM is shown as Active when it is not actually ACTIVE i.e. UI is accessible and getServiceState returns both RM as Active. And when we access UI or get service state we check what's the state in RM Context. And that is ACTIVE.
So for anyone who is accessing RM from command line or via UI, RM is active(because RM context says so), when it is not really active. Both RMs' are just trying incessantly to become active and failing.
That is why I suggested that we can update the RM Context. Infact changing RM context is necessary. We can decide when to stop active services, if at all.
So there are 2 options :
- We can set RM context to standby when exception occurs and stop active services. But if we do it, this would mean we will have to redo the work of starting active services again if this RM were to become ACTIVE.
- Introduce a new state (say WAITING_FOR_ACTIVE) and set this state when exception is thrown and check this state to stop active services when switching to standby. And not starting the services again in case of switching to ACTIVE.
Thoughts, Sunil G, Xuan Gong ?