• Type: New Feature New Feature
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:


      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

      1. rm-ha-phase1-approach-draft1.pdf
        165 kB
        Karthik Kambatla
      2. rm-ha-phase1-draft2.pdf
        170 kB
        Karthik Kambatla
      3. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
        207 kB
        Bikas Saha
      4. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
        207 kB
        Bikas Saha

        Issue Links

        1. Add shutdown support to non-service RM components Sub-task Open Xuan Gong  
        2. Support automatic failover using ZKFC Sub-task Open Karthik Kambatla  
        3. Add end-to-end tests for HA Sub-task Open Xuan Gong  
        4. Verify RM HA works in secure clusters Sub-task Open Wing Yew Poon  
        5. RM should log using RMStore at startup time Sub-task Patch Available Tsuyoshi OZAWA


        Original Estimate - 3h
        Remaining Estimate - 3h
        6. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Sub-task Open Tsuyoshi OZAWA  
        7. Add an option to yarn rmadmin to clear the znode used by embedded elector Sub-task Open Karthik Kambatla  
        8. Add an easy way to turn on HA Sub-task Open Karthik Kambatla  
        9. Support explicit failover when automatic failover is enabled Sub-task Open Karthik Kambatla  
        10. Make admin refresh of Fair scheduler configuration work across RM failover Sub-task Open Xuan Gong  
        11. Cleanup YARN HAUtil class Sub-task Open Vinod Kumar Vavilapalli  
        12. Document RM HA Sub-task Open Karthik Kambatla  
        13. Handle AMRMTokens across RM failover Sub-task Open Unassigned  
        14. Add retry cache support in ResourceManager Sub-task Open Tsuyoshi OZAWA  
        15. Persist ClusterMetrics across RM HA transitions Sub-task Open Unassigned  
        16. cancelDelegationToken should survive RM failover Sub-task Open Zhijie Shen  
        17. Both RM stuck in standby mode when automatic failover is enabled Sub-task Open Vinod Kumar Vavilapalli  
        18. ZK store: Add yarn.resourcemanager.zk-state-store.root-node.auth for root node auth Sub-task Open Karthik Kambatla  
        19. Implement and verify Scheduler#moveApplication() idempotent for CapacityScheduler/FairScheduler Sub-task Open Xuan Gong  
        20. Yarn standby RM taking long to transition to active Sub-task Patch Available Xuan Gong  



            • Assignee:
              Harsh J
            • Votes:
              2 Vote for this issue
              71 Start watching this issue


              • Created:

                Time Tracking

                Original Estimate - 51h
                Remaining Estimate - 51h
                Time Spent - Not Specified
                Not Specified