Thanks for detailed review !
All that is needed is store the CuratorFramework instance in RMContext.
Actually, I need to refactor out the zkClient creation logic from ZKRMStateStore as the zkClient is requiring a bunch of other configs. And because ZKRMStateStore is currently in active service, it cannot be simply moved to AlwaysOn service. So, I'd like to do it separately to minimize the core change in this jira.
The instance, rm, is not used anywhere. Why even pass it?
I was ealier directly calling rm.transitionToActive instead of calling AdminService#transitionToActive. But just to minimize the change and keep it consistent with EmbeddedElectorService, I changed to call AdminService#transitionToActive.
The only extra thing AdminService does is to refresh the ACLs. Suppose the shared storage based configraion provider is not enabled(which is the most usual case), why do we need to call refresh the configs? It cannot read the remote RM's config anyway. Without calling these refresh calls, we can avoid bugs like
YARN-3893. Also, RM itself does not need to depend on the AdminACl for it to transition to active/standby. It should always has the permission to do that. I'd like to change this part for RM to not refresh the configs if shared storage based config provider is not enabled.
why sleep for 1 second
To avoid a busy loop and rejoining immediately. That's what ActiveStandbyElector does too. It could be more than 1s. I don't think we need one more config for this.
If it is due to close(), don't we want to force give-up so the other RM becomes active? If it is on initAndStartLeaderLatch(), this RM will never become active; don't we want to just die?
What do you mean by force give-up ? exit RM ?
The underlying curator implementation will retry the connection in background, even though the exception is thrown. See Guaranteeable interface in Curator. I think exit RM is too harsh here. Even though RM remains at standby, all services should be already shutdown, so there's no harm to the end users ?
I have one question about ActiveStandbyCheckThread. if we make zkStateStore and elector to share the same zkClient, do we still need the ActiveStandbyCheckThread ? the elector itself should get notification when the connection is lost.
notLeader: Again, we should likely do more than just logging.
This is currently what EmbeddedElectorService is doing. If the leadership is already lost from zk's perspective, the other RM should take up the leadership
How about adding a method called closeLeaderLatch to complement initAndStart? That would help us avoid cases like the null check missing in rejoinElection?
I think leaderLatch could never be null ?
may be we should have a config to use embedded-elector instead of curator-elector e.g. yarn.resourcemanager.ha.use-active-standby-elector
This flag is just a temporary thing, a lot of test cases need to be changed without this flag. I plan to remove this flag and the embeddedElector code too in followup.
Why change the argument to transitionToStandby from true to false? in the following method, reinitialize(initialize) should be called outside the if. No?
Why does it need to be called outside of if (state == HAServiceProtocol.HAServiceState.ACTIVE) ? This is a fresh start, it does not need to call reinitiialize.
still feel the AdminService should be the one handling the LeaderElectorService. Also, the LeaderElectorService talks to AdminService for transitions to active/standby.
Currently, AdminService does not depend on EmbeddedLeaderElector at all. All it does is to initialize EmbeddedElectorService. May be the elector does not need to depend on AdminService too, i.e. not need to refresh the acls if shared storage based config provider is not enabled.
Will update other comments accordingly.