Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3639

It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      If the active RM and NN go down at the same time, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original NN, the result of which is time-out after 10~20s, and then the client tries to connect to the new NN. The entire recovery cost 15*#apps seconds according our test.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                xinxianyin Xianyin Xin
              • Votes:
                0 Vote for this issue
                Watchers:
                12 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: