Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4497

RM might fail to restart when recovering apps whose attempts are missing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Find following problem when discussing in YARN-3480.

      If RM fails to store some attempts in RMStateStore, there will be missing attempts in RMStateStore, for the case storing attempt1, attempt2 and attempt3, RM successfully stored attempt1 and attempt3, but failed to store attempt2. When RM restarts, in RMAppImpl#recover, we recover attempts one by one, for this case, we will recover attmept1, then attempt2. When recovering attempt2, we call ((RMAppAttemptImpl)this.currentAttempt).recover(state), it will first find its ApplicationAttemptStateData, but it could not find it, an error will come at assert attemptState != null(RMAppAttemptImpl#recover, line 880).

        Attachments

        1. YARN-4497.04.patch
          8 kB
          Jun Gong
        2. YARN-4497.03.patch
          8 kB
          Jun Gong
        3. YARN-4497.02.patch
          8 kB
          Jun Gong
        4. YARN-4497.01.patch
          6 kB
          Jun Gong

          Issue Links

            Activity

              People

              • Assignee:
                hex108 Jun Gong
                Reporter:
                hex108 Jun Gong
              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: