Hadoop YARN
  1. Hadoop YARN
  2. YARN-556

[Umbrella] RM Restart phase 2 - Work preserving restart

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None
    • Target Version/s:

      Description

      YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts.

      1. YARN-1372.prelim.patch
        53 kB
        Anubhav Dhoot
      2. Work Preserving RM Restart.pdf
        202 kB
        Bikas Saha
      3. WorkPreservingRestartPrototype.001.patch
        88 kB
        Anubhav Dhoot

        Issue Links

        1.
        ApplicationMasterService to allow Register of an app that was running before restart Sub-task Closed Anubhav Dhoot
         
        2.
        AM should implement Resync with the ApplicationMasterService instead of shutting down Sub-task Closed Rohith Sharma K S
         
        3.
        After restart NM should resync with the RM without killing containers Sub-task Closed Anubhav Dhoot
         
        4.
        Common work to re-populate containers’ state into scheduler Sub-task Closed Jian He
         
        5.
        Capacity scheduler to re-populate container allocation state Sub-task Resolved Jian He
         
        6.
        Fair scheduler to re-populate container allocation state Sub-task Closed Anubhav Dhoot
         
        7.
        FIFO scheduler to re-populate container allocation state Sub-task Resolved Jian He
         
        8.
        Ensure all completed containers are reported to the AMs across RM restart Sub-task Closed Anubhav Dhoot
         
        9.
        Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps Sub-task Resolved Omkar Vinit Joshi
         
        10.
        Revisit AM link being broken for work preserving restart Sub-task Resolved Unassigned
         
        11.
        Recover Unmanaged AMs Sub-task Resolved Anubhav Dhoot
         
        12.
        Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Sub-task Closed Tsuyoshi Ozawa
         
        13.
        Fix ordering of starting services inside the RM Sub-task Resolved Jian He
         
        14.
        Threshold for RM to accept requests from AM after failover Sub-task Closed Jian He
         
        15.
        Merge some of the common lib code in schedulers Sub-task Closed Jian He
         
        16.
        ContainerId creation after work preserving restart is broken Sub-task Closed Tsuyoshi Ozawa
         
        17.
        Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus Sub-task Closed Jian He
         
        18.
        Recover missing container information Sub-task Closed Jian He
         
        19.
        Ensure distributed shell work with RM work-preserving recovery Sub-task Closed Jian He
         
        20.
        Update ContainerId#toString() to avoid conflicts before and after RM restart Sub-task Closed Tsuyoshi Ozawa
         
        21.
        ContainerId can overflow with RM restart Sub-task Closed Tsuyoshi Ozawa
         
        22.
        AM release request may be lost on RM restart Sub-task Closed Jian He
         
        23.
        Add containers to launchedContainers list in RMNode on container recovery Sub-task Closed Jian He
         
        24.
        Marking ContainerId#getId as deprecated Sub-task Closed Tsuyoshi Ozawa
         
        25.
        RM should not recover containers from previously failed attempt when AM restart is not enabled Sub-task Closed Jian He
         
        26.
        Possible livelock in CapacityScheduler when RM is recovering apps Sub-task Closed Jian He
         
        27.
        Update ConverterUtils#toContainerId to parse epoch Sub-task Closed Tsuyoshi Ozawa
         
        28.
        Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId Sub-task Closed Tsuyoshi Ozawa
         
        29. Add a percentage-node threshold for RM to wait for new allocations after restart/failover Sub-task Open Vinod Kumar Vavilapalli
         
        30. Distributed shell AM may re-launch containers if RM work preserving restart happens Sub-task Patch Available Chun Chen
         
        31.
        TestWorkPreservingRMRestart: Augment FS tests with queue and headroom checks Sub-task Closed Tsuyoshi Ozawa
         
        32.
        NPE when RM tries to transfer state from previous attempt on recovery Sub-task Resolved Jian He
         
        33.
        Document work-preserving RM restart Sub-task Closed Jian He
         
        34.
        Make work-preserving-recovery the default mechanism for RM recovery Sub-task Closed Jian He
         

          Activity

          Vinod Kumar Vavilapalli made changes -
          Link This issue relates to MAPREDUCE-5567 [ MAPREDUCE-5567 ]
          Vinod Kumar Vavilapalli made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Assignee Bikas Saha [ bikassaha ]
          Resolution Fixed [ 1 ]
          Vinod Kumar Vavilapalli made changes -
          Summary RM Restart phase 2 - Work preserving restart [Umbrella] RM Restart phase 2 - Work preserving restart
          Anubhav Dhoot made changes -
          Attachment YARN-1372.prelim.patch [ 12659218 ]
          Karthik Kambatla (Inactive) made changes -
          Target Version/s 2.5.0 [ 12326262 ] 2.6.0 [ 12327197 ]
          Junping Du made changes -
          Link This issue is depended upon by YARN-666 [ YARN-666 ]
          Anubhav Dhoot made changes -
          Karthik Kambatla (Inactive) made changes -
          Target Version/s 2.5.0 [ 12326262 ]
          Bikas Saha made changes -
          Link This issue relates to YARN-149 [ YARN-149 ]
          Bikas Saha made changes -
          Attachment Work Preserving RM Restart.pdf [ 12599562 ]
          Bikas Saha made changes -
          Labels gsoc2013
          Bikas Saha made changes -
          Description The basic idea is already documented on YARN-128. This will describe further details. YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts.
          Bikas Saha made changes -
          Link This issue is related to YARN-128 [ YARN-128 ]
          Bikas Saha made changes -
          Parent YARN-128 [ 12559788 ]
          Issue Type Sub-task [ 7 ] New Feature [ 2 ]
          Bikas Saha made changes -
          Summary RM Restart phase 2 - Design for work preserving restart RM Restart phase 2 - Work preserving restart
          Bikas Saha made changes -
          Labels gsoc2013
          Bikas Saha made changes -
          Field Original Value New Value
          Component/s resourcemanager [ 12319322 ]
          Bikas Saha created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Bikas Saha
            • Votes:
              0 Vote for this issue
              Watchers:
              44 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development