Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4401

A failed app recovery should not prevent the RM from starting

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Won't Fix
    • Affects Version/s: 2.7.1
    • Fix Version/s: None
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      There are many different reasons why an app recovery could fail with an exception, causing the RM start to be aborted. If that happens the RM will fail to start. Presumably, the reason the RM is trying to do a recovery is that it's the standby trying to fill in for the active. Failing to come up defeats the purpose of the HA configuration. Instead of preventing the RM from starting, a failed app recovery should log an error and skip the application.

      1. YARN-4401.001.patch
        2 kB
        Daniel Templeton

        Issue Links

          Activity

          Hide
          rohithsharma Rohith Sharma K S added a comment -

          In an ideal case, app recovery should not fail. If it fails, then fix should given to "cause of failure". Do you have in mind any specific scenario which is causing recovery failure? I am open to get convinced

          Show
          rohithsharma Rohith Sharma K S added a comment - In an ideal case, app recovery should not fail. If it fails, then fix should given to "cause of failure". Do you have in mind any specific scenario which is causing recovery failure? I am open to get convinced
          Hide
          templedf Daniel Templeton added a comment -

          There are lots of reasons a recovery could fail. For example, if a job is stored with a resource allocation that is higher than the configured maximum at the time of recovery, the recovery will throw an exception which will prevent the RM from starting.

          In a single RM configuration, it makes some sense to allow the RM restart to be interrupted by recovery failure, but in an HA scenario, the standby in becoming active to prevent an outage. Causing an outage over a bad application is undermining the point of HA. It becomes a question of trading an application failure for a service outage. I think most sites would choose the former.

          There's already yarn.fail-fast and yarn.resourcemanager.fail-fast that control this behavior for some of the recovery failure scenarios, such as bad queue assignments. I would propose we extend the meaning of those properties to cover the full range of what could go wrong during recovery.

          Show
          templedf Daniel Templeton added a comment - There are lots of reasons a recovery could fail. For example, if a job is stored with a resource allocation that is higher than the configured maximum at the time of recovery, the recovery will throw an exception which will prevent the RM from starting. In a single RM configuration, it makes some sense to allow the RM restart to be interrupted by recovery failure, but in an HA scenario, the standby in becoming active to prevent an outage. Causing an outage over a bad application is undermining the point of HA. It becomes a question of trading an application failure for a service outage. I think most sites would choose the former. There's already yarn.fail-fast and yarn.resourcemanager.fail-fast that control this behavior for some of the recovery failure scenarios, such as bad queue assignments. I would propose we extend the meaning of those properties to cover the full range of what could go wrong during recovery.
          Hide
          templedf Daniel Templeton added a comment -

          Here's the basic idea of what I'm proposing.

          Show
          templedf Daniel Templeton added a comment - Here's the basic idea of what I'm proposing.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          if a job is stored with a resource allocation that is higher than the configured maximum at the time of recovery, the recovery will throw an exception which will prevent the RM from starting.

          Which version of Hadoop are you using? This issue is fixed in YARN-3493.

          And regarding the patch, app should never be removed from RMContext at any point of time during recovery, it causes ApplincationNotFoundException to client which is incorrect. IAC, to continue any flows, need to trigger an appropriate event which makes state transition complete.

          Show
          rohithsharma Rohith Sharma K S added a comment - if a job is stored with a resource allocation that is higher than the configured maximum at the time of recovery, the recovery will throw an exception which will prevent the RM from starting. Which version of Hadoop are you using? This issue is fixed in YARN-3493 . And regarding the patch, app should never be removed from RMContext at any point of time during recovery, it causes ApplincationNotFoundException to client which is incorrect. IAC, to continue any flows, need to trigger an appropriate event which makes state transition complete.
          Hide
          sunilg Sunil G added a comment -

          Hi Daniel Templeton
          I am not very sure about the use case here. However I feel if such a case occurs, we will have enough information from logs to get the app-id.
          Then we can use below command to clear such apps if necessary rather than forcefully clear from rmcontext.

          Usage: yarn resourcemanager [-format-state-store]
                                      [-remove-application-from-state-store <appId>]
          
          Show
          sunilg Sunil G added a comment - Hi Daniel Templeton I am not very sure about the use case here. However I feel if such a case occurs, we will have enough information from logs to get the app-id. Then we can use below command to clear such apps if necessary rather than forcefully clear from rmcontext. Usage: yarn resourcemanager [-format-state-store] [-remove-application-from-state-store <appId>]
          Hide
          templedf Daniel Templeton added a comment -

          I suppose I posed my proposal a little naively. Let's try again.

          The reason for configuring HA is to prevent an outage. It should be possible to tell the standby to come up regardless of recovery failures, in effect performing automatically the operation that Sunil G described or failing the bad app(s) or whatever.

          The app resource issue I offered was just the first example I (thought I) found while skimming the code. Rather than having to hunt down every possible way to throw an exception (checked or unchecked) during recovery, it would be convenient to have recovery catch any exception, log it, and do something sensible so that the RM can come up for cases where RM availability is a priority.

          Show
          templedf Daniel Templeton added a comment - I suppose I posed my proposal a little naively. Let's try again. The reason for configuring HA is to prevent an outage. It should be possible to tell the standby to come up regardless of recovery failures, in effect performing automatically the operation that Sunil G described or failing the bad app(s) or whatever. The app resource issue I offered was just the first example I (thought I) found while skimming the code. Rather than having to hunt down every possible way to throw an exception (checked or unchecked) during recovery, it would be convenient to have recovery catch any exception, log it, and do something sensible so that the RM can come up for cases where RM availability is a priority.
          Hide
          templedf Daniel Templeton added a comment -

          This JIRA is superseded by YARN-6035, YARN-6036, and YARN-6037, which capture the same idea but more supportably.

          Show
          templedf Daniel Templeton added a comment - This JIRA is superseded by YARN-6035 , YARN-6036 , and YARN-6037 , which capture the same idea but more supportably.

            People

            • Assignee:
              templedf Daniel Templeton
              Reporter:
              templedf Daniel Templeton
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development