Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-556

[Umbrella] RM Restart phase 2 - Work preserving restart

    XMLWordPrintableJSON

Details

    Description

      YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts.

      Attachments

        1. WorkPreservingRestartPrototype.001.patch
          88 kB
          Anubhav Dhoot
        2. Work Preserving RM Restart.pdf
          202 kB
          Bikas Saha
        3. YARN-1372.prelim.patch
          53 kB
          Anubhav Dhoot

        Issue Links

          1.
          ApplicationMasterService to allow Register of an app that was running before restart Sub-task Closed Anubhav Dhoot
          2.
          AM should implement Resync with the ApplicationMasterService instead of shutting down Sub-task Closed Rohith Sharma K S
          3.
          After restart NM should resync with the RM without killing containers Sub-task Closed Anubhav Dhoot
          4.
          Common work to re-populate containers’ state into scheduler Sub-task Closed Jian He
          5.
          Capacity scheduler to re-populate container allocation state Sub-task Resolved Jian He
          6.
          Fair scheduler to re-populate container allocation state Sub-task Closed Anubhav Dhoot
          7.
          FIFO scheduler to re-populate container allocation state Sub-task Resolved Jian He
          8.
          Ensure all completed containers are reported to the AMs across RM restart Sub-task Closed Anubhav Dhoot
          9.
          Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps Sub-task Resolved Omkar Vinit Joshi
          10.
          Revisit AM link being broken for work preserving restart Sub-task Resolved Unassigned
          11.
          Recover Unmanaged AMs Sub-task Resolved Anubhav Dhoot
          12.
          Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Sub-task Closed Tsuyoshi Ozawa
          13.
          Fix ordering of starting services inside the RM Sub-task Resolved Jian He
          14.
          Threshold for RM to accept requests from AM after failover Sub-task Closed Jian He
          15.
          Merge some of the common lib code in schedulers Sub-task Closed Jian He
          16.
          ContainerId creation after work preserving restart is broken Sub-task Closed Tsuyoshi Ozawa
          17.
          Replace RegisterNodeManagerRequest#ContainerStatus with a new NMContainerStatus Sub-task Closed Jian He
          18.
          Recover missing container information Sub-task Closed Jian He
          19.
          Ensure distributed shell work with RM work-preserving recovery Sub-task Closed Jian He
          20.
          Update ContainerId#toString() to avoid conflicts before and after RM restart Sub-task Closed Tsuyoshi Ozawa
          21.
          ContainerId can overflow with RM restart Sub-task Closed Tsuyoshi Ozawa
          22.
          AM release request may be lost on RM restart Sub-task Closed Jian He
          23.
          Add containers to launchedContainers list in RMNode on container recovery Sub-task Closed Jian He
          24.
          Marking ContainerId#getId as deprecated Sub-task Closed Tsuyoshi Ozawa
          25.
          RM should not recover containers from previously failed attempt when AM restart is not enabled Sub-task Closed Jian He
          26.
          Possible livelock in CapacityScheduler when RM is recovering apps Sub-task Closed Jian He
          27.
          Update ConverterUtils#toContainerId to parse epoch Sub-task Closed Tsuyoshi Ozawa
          28.
          Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId Sub-task Closed Tsuyoshi Ozawa
          29.
          Add a percentage-node threshold for RM to wait for new allocations after restart/failover Sub-task Open Vinod Kumar Vavilapalli
          30.
          Distributed shell AM may re-launch containers if RM work preserving restart happens Sub-task Resolved Shane Kumpf
          31.
          TestWorkPreservingRMRestart: Augment FS tests with queue and headroom checks Sub-task Closed Tsuyoshi Ozawa
          32.
          NPE when RM tries to transfer state from previous attempt on recovery Sub-task Resolved Jian He
          33.
          Document work-preserving RM restart Sub-task Closed Jian He
          34.
          Make work-preserving-recovery the default mechanism for RM recovery Sub-task Closed Jian He

          Activity

            People

              Unassigned Unassigned
              bikassaha Bikas Saha
              Votes:
              0 Vote for this issue
              Watchers:
              50 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: