Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.

      1. rm-ha-phase1-approach-draft1.pdf
        165 kB
        Karthik Kambatla
      2. rm-ha-phase1-draft2.pdf
        170 kB
        Karthik Kambatla
      3. YARN ResourceManager Automatic Failover-rev-07-21-13.pdf
        207 kB
        Bikas Saha
      4. YARN ResourceManager Automatic Failover-rev-08-04-13.pdf
        207 kB
        Bikas Saha

        Issue Links

        1. Add shutdown support to non-service RM components Sub-task Open Xuan Gong  
         
        2. Support automatic failover using ZKFC Sub-task Open Karthik Kambatla  
         
        3. Add end-to-end tests for HA Sub-task Open Xuan Gong  
         
        4. Verify RM HA works in secure clusters Sub-task Open Wing Yew Poon  
         
        5. RM should log using RMStore at startup time Sub-task Patch Available Tsuyoshi OZAWA

        0%

        Original Estimate - 3h
        Remaining Estimate - 3h
         
        6. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Sub-task Open Tsuyoshi OZAWA  
         
        7. Add an option to yarn rmadmin to clear the znode used by embedded elector Sub-task Open Karthik Kambatla  
         
        8. Add an easy way to turn on HA Sub-task Open Karthik Kambatla  
         
        9. Support explicit failover when automatic failover is enabled Sub-task Open Karthik Kambatla  
         
        10. Make admin refresh of Fair scheduler configuration work across RM failover Sub-task Open Xuan Gong  
         
        11. Cleanup YARN HAUtil class Sub-task Open Vinod Kumar Vavilapalli  
         
        12. Document RM HA Sub-task Open Karthik Kambatla  
         
        13. Handle AMRMTokens across RM failover Sub-task Open Unassigned  
         
        14. Add retry cache support in ResourceManager Sub-task Open Tsuyoshi OZAWA  
         
        15. Persist ClusterMetrics across RM HA transitions Sub-task Open Unassigned  
         
        16. cancelDelegationToken should survive RM failover Sub-task Open Zhijie Shen  
         
        17. Both RM stuck in standby mode when automatic failover is enabled Sub-task Open Vinod Kumar Vavilapalli  
         
        18. ZK store: Add yarn.resourcemanager.zk-state-store.root-node.auth for root node auth Sub-task Open Karthik Kambatla  
         
        19. Implement and verify Scheduler#moveApplication() idempotent for CapacityScheduler/FairScheduler Sub-task Open Xuan Gong  
         
        20. Yarn standby RM taking long to transition to active Sub-task Patch Available Xuan Gong  
         

          Activity

          Vinod Kumar Vavilapalli made changes -
          Assignee Bikas Saha [ bikassaha ]
          Component/s resourcemanager [ 12319322 ]
          Karthik Kambatla made changes -
          Link This issue is duplicated by YARN-1585 [ YARN-1585 ]
          Tsuyoshi OZAWA made changes -
          Link This issue relates to YARN-1543 [ YARN-1543 ]
          Karthik Kambatla made changes -
          Link This issue relates to YARN-1460 [ YARN-1460 ]
          Steve Loughran made changes -
          Link This issue is related to HADOOP-9905 [ HADOOP-9905 ]
          Bikas Saha made changes -
          Link This issue is blocked by YARN-1318 [ YARN-1318 ]
          Tsuyoshi OZAWA made changes -
          Link This issue relates to YARN-1305 [ YARN-1305 ]
          Junping Du made changes -
          Assignee Bikas Saha [ bikassaha ]
          shenhong made changes -
          Assignee shenhong [ shenhong ]
          shenhong made changes -
          Assignee Bikas Saha [ bikassaha ] shenhong [ shenhong ]
          Bikas Saha made changes -
          Link This issue is related to YARN-556 [ YARN-556 ]
          Karthik Kambatla made changes -
          Link This issue relates to YARN-1139 [ YARN-1139 ]
          Bikas Saha made changes -
          Bikas Saha made changes -
          Karthik Kambatla made changes -
          Attachment rm-ha-phase1-draft2.pdf [ 12591692 ]
          Karthik Kambatla made changes -
          Attachment rm-ha-phase1-approach-draft1.pdf [ 12591148 ]
          Karthik Kambatla made changes -
          Attachment rm-ha-phase1-approach-draft1.pdf [ 12591147 ]
          Karthik Kambatla made changes -
          Attachment rm-ha-phase1-approach-draft1.pdf [ 12591147 ]
          Bikas Saha made changes -
          Description  One of the goals presented on MAPREDUCE-279 was to have high availability. One way that was discussed, per Mahadev/others on https://issues.apache.org/jira/browse/MAPREDUCE-2648 and other places, was ZK:

          {quote}
          Am not sure, if you already know about the MR-279 branch (the next version of MR framework). We've been trying to integrate ZK into the framework from the beginning. As for now, we are just doing restart with ZK but soon we should have a HA soln with ZK.
          {quote}

          There is now MAPREDUCE-4343 that tracks recoverability via ZK. This JIRA is meant to track HA via ZK.

          Currently there isn't a HA solution for RM, via ZK or otherwise.
          This jira tracks work needed to be done to support one RM instance failing over to another RM instance so that we can have RM HA. Work includes leader election, transfer of control to leader and client re-direction to new leader.
          Philip Zeyliger made changes -
          Description One of the goals presented on MAPREDUCE-279 was to have high availability. One way that was discussed, per Mahadev/others on https://issues.apache.org/jira/browse/MAPREDUCE-2648 and other places, was ZK:

          {quote}
          Am not sure, if you already know about the MR-279 branch (the next version of MR framework). We've been trying to integrate ZK into the framework from the beginning. As for now, we are just doing restart with ZK but soon we should have a HA soln with ZK.
          {quote}

          There is now MAPREDUCE-4343 that tracks recoverability via ZK. This JIRA is meant to track HA via ZK.

          Currently there isn't a HA solution for RM, via ZK or otherwise.
           One of the goals presented on MAPREDUCE-279 was to have high availability. One way that was discussed, per Mahadev/others on https://issues.apache.org/jira/browse/MAPREDUCE-2648 and other places, was ZK:

          {quote}
          Am not sure, if you already know about the MR-279 branch (the next version of MR framework). We've been trying to integrate ZK into the framework from the beginning. As for now, we are just doing restart with ZK but soon we should have a HA soln with ZK.
          {quote}

          There is now MAPREDUCE-4343 that tracks recoverability via ZK. This JIRA is meant to track HA via ZK.

          Currently there isn't a HA solution for RM, via ZK or otherwise.
          Bikas Saha made changes -
          Summary ZK-based High Availability (HA) for ResourceManager (RM) ResourceManager (RM) High-Availability (HA)
          Bikas Saha made changes -
          Assignee Bikas Saha [ bikassaha ]
          Eli Collins made changes -
          Resolution Duplicate [ 3 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Assignee Bikas Saha [ bikassaha ]
          Eli Collins made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Harsh J made changes -
          Project Hadoop Map/Reduce [ 12310941 ] Hadoop YARN [ 12313722 ]
          Key MAPREDUCE-4345 YARN-149
          Issue Type Improvement [ 4 ] New Feature [ 2 ]
          Harsh J made changes -
          Resolution Duplicate [ 3 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Arun C Murthy made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Bikas Saha made changes -
          Assignee Bikas Saha [ bikassaha ]
          Bikas Saha made changes -
          Link This issue is related to MAPREDUCE-4326 [ MAPREDUCE-4326 ]
          Harsh J made changes -
          Link This issue is related to MAPREDUCE-2288 [ MAPREDUCE-2288 ]
          Harsh J made changes -
          Field Original Value New Value
          Link This issue is part of MAPREDUCE-279 [ MAPREDUCE-279 ]
          Harsh J created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Harsh J
            • Votes:
              2 Vote for this issue
              Watchers:
              71 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 51h
                51h
                Remaining:
                Remaining Estimate - 51h
                51h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development