Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

      1. rm-ha-phase1-approach-draft1.pdf
        165 kB
        Karthik Kambatla
      2. rm-ha-phase1-draft2.pdf
        170 kB
        Karthik Kambatla
      3. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
        207 kB
        Bikas Saha
      4. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
        207 kB
        Bikas Saha

        Issue Links

        1. Add shutdown support to non-service RM components Sub-task Open Xuan Gong  
         
        2. Support automatic failover using ZKFC Sub-task Open Karthik Kambatla  
         
        3. Add end-to-end tests for HA Sub-task Open Xuan Gong  
         
        4. Verify RM HA works in secure clusters Sub-task Open Wing Yew Poon  
         
        5. RM should log using RMStore at startup time Sub-task Patch Available Tsuyoshi OZAWA

        0%

        Original Estimate - 3h
        Remaining Estimate - 3h
         
        6. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Sub-task Open Tsuyoshi OZAWA  
         
        7. Add an option to yarn rmadmin to clear the znode used by embedded elector Sub-task Open Karthik Kambatla  
         
        8. Add an easy way to turn on HA Sub-task Open Karthik Kambatla  
         
        9. Support explicit failover when automatic failover is enabled Sub-task Open Karthik Kambatla  
         
        10. Make admin refresh of Fair scheduler configuration work across RM failover Sub-task Open Xuan Gong  
         
        11. Cleanup YARN HAUtil class Sub-task Open Vinod Kumar Vavilapalli  
         
        12. Document RM HA Sub-task Open Karthik Kambatla  
         
        13. Handle AMRMTokens across RM failover Sub-task Open Unassigned  
         
        14. Add retry cache support in ResourceManager Sub-task Open Tsuyoshi OZAWA  
         
        15. Persist ClusterMetrics across RM HA transitions Sub-task Open Unassigned  
         
        16. cancelDelegationToken should survive RM failover Sub-task Open Zhijie Shen  
         
        17. Both RM stuck in standby mode when automatic failover is enabled Sub-task Open Vinod Kumar Vavilapalli  
         
        18. ZK store: Add yarn.resourcemanager.zk-state-store.root-node.auth for root node auth Sub-task Open Karthik Kambatla  
         
        19. Implement and verify Scheduler#moveApplication() idempotent for CapacityScheduler/FairScheduler Sub-task Open Xuan Gong  
         
        20. Yarn standby RM taking long to transition to active Sub-task Patch Available Xuan Gong  
         

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Harsh J
            • Votes:
              2 Vote for this issue
              Watchers:
              71 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 51h
                51h
                Remaining:
                Remaining Estimate - 51h
                51h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development