Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-128

[Umbrella] RM Restart Phase 1: State storage and non-work-preserving recovery

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0-alpha
    • None
    • resourcemanager
    • None

    Description

      This umbrella jira tracks the work needed to preserve critical state information and reload them upon RM restart.

      Attachments

        1. MR-4343.1.patch
          17 kB
          Tsuyoshi Ozawa
        2. restart-12-11-zkstore.patch
          21 kB
          Bikas Saha
        3. restart-fs-store-11-17.patch
          17 kB
          Bikas Saha
        4. restart-zk-store-11-17.patch
          61 kB
          Bikas Saha
        5. RM-recovery-initial-thoughts.txt
          3 kB
          Bikas Saha
        6. RMRestartPhase1.pdf
          59 kB
          Bikas Saha
        7. YARN-128.full-code.3.patch
          176 kB
          Bikas Saha
        8. YARN-128.full-code.5.patch
          255 kB
          Bikas Saha
        9. YARN-128.full-code-4.patch
          179 kB
          Bikas Saha
        10. YARN-128.new-code-added.3.patch
          74 kB
          Bikas Saha
        11. YARN-128.new-code-added-4.patch
          78 kB
          Bikas Saha
        12. YARN-128.old-code-removed.3.patch
          123 kB
          Bikas Saha
        13. YARN-128.old-code-removed.4.patch
          123 kB
          Bikas Saha
        14. YARN-128.patch
          92 kB
          Devaraj Kavali

        Issue Links

        1.
        Remove old code for restart Sub-task Closed Bikas Saha Actions
        2.
        Make changes for RM restart phase 1 Sub-task Closed Bikas Saha Actions
        3.
        Add FS-based persistent store implementation for RMStateStore Sub-task Closed Bikas Saha Actions
        4.
        Add FileSystem based store for RM Sub-task Resolved Bikas Saha Actions
        5.
        Security related work for RM restart Sub-task Resolved Bikas Saha Actions
        6.
        Add Zookeeper-based store implementation for RMStateStore Sub-task Closed Karthik Kambatla Actions
        7.
        Add HDFS based store for RM which manages the store using directories Sub-task Resolved Jian He Actions
        8.
        Create common proxy client for communicating with RM Sub-task Closed Jian He Actions
        9.
        Delayed store operations should not result in RM unavailability for app submission Sub-task Closed Zhijie Shen Actions
        10.
        AM max attempts is not checked when RM restart and try to recover attempts Sub-task Closed Jian He Actions
        11.
        Race condition causing RM to potentially relaunch already unregistered AMs on RM restart Sub-task Closed Jian He Actions
        12.
        NM should reject containers allocated by previous RM Sub-task Closed Jian He Actions
        13.
        Test and verify that app delegation tokens are added to tokenRenewer after RM restart Sub-task Closed Jian He Actions
        14.
        Restore appToken and clientToken for app attempt after RM restart Sub-task Closed Jian He Actions
        15.
        Restore clientToken for app attempt after RM restart Sub-task Resolved Jian He Actions
        16.
        Restore RMDelegationTokens after RM Restart Sub-task Closed Jian He Actions
        17.
        RMStateStore's removeApplication APIs should just take an applicationId Sub-task Resolved Tsuyoshi Ozawa Actions
        18.
        Slow or failing DelegationToken renewals on submission itself make RM unavailable Sub-task Closed Omkar Vinit Joshi Actions
        19.
        verify that new jobs submitted with old RM delegation tokens after RM restart are accepted Sub-task Closed Jian He Actions
        20.
        Store completed application information in RM state store Sub-task Closed Jian He Actions
        21.
        RM crashes if it restarts while the state-store is down Sub-task Closed Jian He Actions
        22.
        Apps Completed metrics on web UI is not correct after RM restart Sub-task Resolved Jian He Actions
        23.
        List of applications at NM web UI is inconsistent with applications at RM UI after RM restart Sub-task Resolved Jian He Actions
        24.
        Change FileSystemRMStateStore to use directories Sub-task Closed Jian He Actions
        25.
        Document RM Restart feature Sub-task Closed Jian He Actions
        26.
        "Active users" field in Resourcemanager scheduler UI gives negative values Sub-task Resolved Unassigned Actions
        27.
        Handle app recovery differently for AM failures and RM restart Sub-task Resolved Unassigned Actions
        28.
        Recovery issues on RM Restart with FileSystemRMStateStore Sub-task Resolved Karthik Kambatla Actions
        29.
        Populate AMRMTokens back to AMRMTokenSecretManager after RM restarts Sub-task Closed Jian He Actions
        30.
        RMStateStore should flush all pending store events before closing Sub-task Closed Jian He Actions
        31.
        RM may relaunch already KILLED / FAILED jobs after RM restarts Sub-task Resolved Jian He Actions
        32.
        AM fails to register if RM restarts within 5s of job submission Sub-task Resolved Unassigned Actions
        33.
        During RM restart, RM should start a new attempt only when previous attempt exits for real Sub-task Closed Omkar Vinit Joshi Actions
        34.
        Register ClientToken MasterKey in SecretManager after it is saved Sub-task Closed Jian He Actions
        35.
        Save version information in the state store Sub-task Closed Jian He Actions
        36.
        FileSystemRMStateStore can leave partial files that prevent subsequent recovery Sub-task Closed Omkar Vinit Joshi Actions
        37.
        Rethink znode structure for RM HA Sub-task Closed Tsuyoshi Ozawa Actions
        38.
        Batching optimization for ZKRMStateStore Sub-task Resolved Tsuyoshi Ozawa Actions
        39.
        Implement a RMStateStore cleaner for deleting application/attempt info Sub-task Closed Jian He Actions
        40.
        RM hangs on shutdown if calling system.exit in serviceInit or serviceStart Sub-task Closed Jian He Actions
        41.
        Check time cost for recovering max-app-limit applications Sub-task Resolved Jian He Actions
        42.
        Change killing application to wait until state store is done Sub-task Closed Jian He Actions
        43.
        Apps should be saved after it's accepted by the scheduler Sub-task Open Jian He Actions
        44.
        Fix invalid RMApp transition from NEW to FINAL_SAVING Sub-task Closed Karthik Kambatla Actions
        45.
        Revisit RMApp transitions from NEW on RECOVER Sub-task Resolved Unassigned Actions
        46.
        Execessive logging for app and attempts on RM recovery Sub-task Open Unassigned Actions
        47.
        Work preserving recovery of Unmanged AMs Sub-task Resolved Subramaniam Krishnan Actions
        48.
        NPE on registerNodeManager if the request has containers for UnmanagedAMs Sub-task Closed Karthik Kambatla Actions
        49.
        Job stays in PREP state for long time after RM Restarts Sub-task Closed Jian He Actions
        50.
        Succeeded application remains in accepted after RM restart Sub-task Closed Jian He Actions
        51.
        Better reporting of finished containers to AMs Sub-task Resolved Unassigned Actions
        52.
        RM should honor NM heartbeat expiry after RM restart Sub-task Open Unassigned Actions
        53.
        Move RM recovery related proto to yarn_server_resourcemanager_recovery.proto Sub-task Closed Tsuyoshi Ozawa Actions
        54.
        Remove ApplicationAttemptState and ApplicationState class in RMStateStore class Sub-task Closed Tsuyoshi Ozawa Actions
        55.
        Add leveldb-based implementation for RMStateStore Sub-task Closed Jason Darrell Lowe Actions
        56.
        RMProxy should retry EOFException Sub-task Closed Jian He Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            acmurthy Arun Murthy
            Votes:
            1 Vote for this issue
            Watchers:
            70 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment