Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5677

RM should transition to standby when connection is lost for an extended period

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.8.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha2
    • Component/s: resourcemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In trunk, there is no maximum number of retries that I see. It appears the connection will be retried forever, with the active never figuring out it's no longer active. In my testing, the active-active state lasted almost 2 hours with no sign of stopping before I killed it. The solution appears to be to cap the number of retries or amount of time spent retrying.

      This issue is significant because of the asynchronous nature of job submission. If the active doesn't know it's not active, it will buffer up job submissions until it finally realizes it has become the standby. Then it will fail all the job submissions in bulk. In high-volume workflows, that behavior can create huge mass job failures.

      This issue is also important because the node managers will not fail over to the new active until the old active realizes it's the standby. Workloads submitted after the old active loses contact with ZK will therefore fail to be executed regardless of which RM the clients contact.

        Attachments

        1. YARN-5677.branch-2.001.patch
          12 kB
          Daniel Templeton
        2. YARN-5677.005.patch
          12 kB
          Daniel Templeton
        3. YARN-5677.004.patch
          12 kB
          Daniel Templeton
        4. YARN-5677.003.patch
          10 kB
          Daniel Templeton
        5. YARN-5677.002.patch
          4 kB
          Daniel Templeton
        6. YARN-5677.001.patch
          4 kB
          Daniel Templeton

          Activity

            People

            • Assignee:
              templedf Daniel Templeton
              Reporter:
              templedf Daniel Templeton
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: