Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-128

[Umbrella] RM Restart Phase 1: State storage and non-work-preserving recovery

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-alpha
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      This umbrella jira tracks the work needed to preserve critical state information and reload them upon RM restart.

        Attachments

        1. MR-4343.1.patch
          17 kB
          Tsuyoshi Ozawa
        2. restart-12-11-zkstore.patch
          21 kB
          Bikas Saha
        3. restart-fs-store-11-17.patch
          17 kB
          Bikas Saha
        4. restart-zk-store-11-17.patch
          61 kB
          Bikas Saha
        5. RM-recovery-initial-thoughts.txt
          3 kB
          Bikas Saha
        6. RMRestartPhase1.pdf
          59 kB
          Bikas Saha
        7. YARN-128.full-code.3.patch
          176 kB
          Bikas Saha
        8. YARN-128.full-code.5.patch
          255 kB
          Bikas Saha
        9. YARN-128.full-code-4.patch
          179 kB
          Bikas Saha
        10. YARN-128.new-code-added.3.patch
          74 kB
          Bikas Saha
        11. YARN-128.new-code-added-4.patch
          78 kB
          Bikas Saha
        12. YARN-128.old-code-removed.3.patch
          123 kB
          Bikas Saha
        13. YARN-128.old-code-removed.4.patch
          123 kB
          Bikas Saha
        14. YARN-128.patch
          92 kB
          Devaraj K

          Issue Links

          1.
          Remove old code for restart Sub-task Closed Bikas Saha
          2.
          Make changes for RM restart phase 1 Sub-task Closed Bikas Saha
          3.
          Add FS-based persistent store implementation for RMStateStore Sub-task Closed Bikas Saha
          4.
          Add FileSystem based store for RM Sub-task Resolved Bikas Saha
          5.
          Security related work for RM restart Sub-task Resolved Bikas Saha
          6.
          Add Zookeeper-based store implementation for RMStateStore Sub-task Closed Karthik Kambatla
          7.
          Add HDFS based store for RM which manages the store using directories Sub-task Resolved Jian He
          8.
          Create common proxy client for communicating with RM Sub-task Closed Jian He
          9.
          Delayed store operations should not result in RM unavailability for app submission Sub-task Closed Zhijie Shen
          10.
          AM max attempts is not checked when RM restart and try to recover attempts Sub-task Closed Jian He
          11.
          Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Sub-task Closed Jian He
          12.
          NM should reject containers allocated by previous RM Sub-task Closed Jian He
          13.
          Test and verify that app delegation tokens are added to tokenRenewer after RM restart Sub-task Closed Jian He
          14.
          Restore appToken and clientToken for app attempt after RM restart Sub-task Closed Jian He
          15.
          Restore clientToken for app attempt after RM restart Sub-task Resolved Jian He
          16.
          Restore RMDelegationTokens after RM Restart Sub-task Closed Jian He
          17.
          RMStateStore's removeApplication APIs should just take an applicationId Sub-task Resolved Tsuyoshi Ozawa
          18.
          Slow or failing DelegationToken renewals on submission itself make RM unavailable Sub-task Closed Omkar Vinit Joshi
          19.
          verify that new jobs submitted with old RM delegation tokens after RM restart are accepted Sub-task Closed Jian He
          20.
          Store completed application information in RM state store Sub-task Closed Jian He
          21.
          RM crashes if it restarts while the state-store is down Sub-task Closed Jian He
          22.
          Apps Completed metrics on web UI is not correct after RM restart Sub-task Resolved Jian He
          23.
          List of applications at NM web UI is inconsistent with applications at RM UI after RM restart Sub-task Resolved Jian He
          24.
          Change FileSystemRMStateStore to use directories Sub-task Closed Jian He
          25.
          Document RM Restart feature Sub-task Closed Jian He
          26.
          "Active users" field in Resourcemanager scheduler UI gives negative values Sub-task Resolved Unassigned
          27.
          Handle app recovery differently for AM failures and RM restart Sub-task Resolved Unassigned
          28.
          Recovery issues on RM Restart with FileSystemRMStateStore Sub-task Resolved Karthik Kambatla
          29.
          Populate AMRMTokens back to AMRMTokenSecretManager after RM restarts Sub-task Closed Jian He
          30.
          RMStateStore should flush all pending store events before closing Sub-task Closed Jian He
          31.
          RM may relaunch already KILLED / FAILED jobs after RM restarts Sub-task Resolved Jian He
          32.
          AM fails to register if RM restarts within 5s of job submission Sub-task Resolved Unassigned
          33.
          During RM restart, RM should start a new attempt only when previous attempt exits for real Sub-task Closed Omkar Vinit Joshi
          34.
          Register ClientToken MasterKey in SecretManager after it is saved Sub-task Closed Jian He
          35.
          Save version information in the state store Sub-task Closed Jian He
          36.
          FileSystemRMStateStore can leave partial files that prevent subsequent recovery Sub-task Closed Omkar Vinit Joshi
          37.
          Rethink znode structure for RM HA Sub-task Closed Tsuyoshi Ozawa
          38.
          Batching optimization for ZKRMStateStore Sub-task Resolved Tsuyoshi Ozawa
          39.
          Implement a RMStateStore cleaner for deleting application/attempt info Sub-task Closed Jian He
          40.
          RM hangs on shutdown if calling system.exit in serviceInit or serviceStart Sub-task Closed Jian He
          41.
          Check time cost for recovering max-app-limit applications Sub-task Resolved Jian He
          42.
          Change killing application to wait until state store is done Sub-task Closed Jian He
          43.
          Apps should be saved after it's accepted by the scheduler Sub-task Open Jian He
          44.
          Fix invalid RMApp transition from NEW to FINAL_SAVING Sub-task Closed Karthik Kambatla
          45.
          Revisit RMApp transitions from NEW on RECOVER Sub-task Resolved Unassigned
          46.
          Execessive logging for app and attempts on RM recovery Sub-task Open Unassigned
          47.
          Work preserving recovery of Unmanged AMs Sub-task Resolved Subru Krishnan
          48.
          NPE on registerNodeManager if the request has containers for UnmanagedAMs Sub-task Closed Karthik Kambatla
          49.
          Job stays in PREP state for long time after RM Restarts Sub-task Closed Jian He
          50.
          Succeeded application remains in accepted after RM restart Sub-task Closed Jian He
          51.
          Better reporting of finished containers to AMs Sub-task Resolved Unassigned
          52.
          RM should honor NM heartbeat expiry after RM restart Sub-task Open Unassigned
          53.
          Move RM recovery related proto to yarn_server_resourcemanager_recovery.proto Sub-task Closed Tsuyoshi Ozawa
          54.
          Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Sub-task Closed Tsuyoshi Ozawa
          55.
          Add leveldb-based implementation for RMStateStore Sub-task Closed Jason Lowe
          56.
          RMProxy should retry EOFException Sub-task Closed Jian He

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                acmurthy Arun C Murthy
              • Votes:
                1 Vote for this issue
                Watchers:
                70 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: