Thanks Jian He for your feedback.
How will the RM failover play together with
YARN-3673 ? Let’s say subCluster1(RM1, RM2), subCluster2(RM3, RM4). Looks like the implementation will ignore cluster intra-failover and do cluster inter-failover only ?
The implementation will handle only cluster intra-failover as the RM failover proxy in
YARN-3673 will be seeded based on subClusterId. The information on the StateStore will get updated as part of RM active services initialization ( YARN-3671). In your example, the RM failover proxy will be a connection to subCluster1 which will initially point to say RM1 which is the current primary. Suppose RM1 fails over to RM2, RM2 will now heartbeat the StateStore against subCluster1 and we will auto-update the proxy to connect to RM2 (by querying getSubClusterInfo(subCluster1) on the Facade).
The cluster inter-failover is determined by the policies (
YARN-5323) as that defines how a queue spans multiple sub-clusters and the Router/AMRMProxy will create a RM failover proxy per subCluster in the policy.
question for such API. It asks for a specific subCluster info, do we still need the filterInactiveSubClusters flag ? Even if it’s required, the behavior for the if/else is inconsistent, the if case is honoring the flag, while the else doesn’t.
Good catch. I have removed filterInactiveSubClusters from getSubClusterInfo.
I think we should not reuse these two configs for retry, the default value of both is zero.
Valid point. I was trying to reuse existing configs as we have to add a few for Federation on top of the too many existing ones. I looked at the RMProxy and have replaced them with better fitting ones.
I have updated the patch (v4) accordingly.