Hadoop YARN
  1. Hadoop YARN
  2. YARN-1489

[Umbrella] Work-preserving ApplicationMaster restart

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Today if AMs go down,

      • RM kills all the containers of that ApplicationAttempt
      • New ApplicationAttempt doesn't know where the previous containers are running
      • Old running containers don't know where the new AM is running.

      We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible.

      1. Work preserving AM restart.pdf
        43 kB
        Vinod Kumar Vavilapalli

        Issue Links

          Activity

          Hide
          Bikas Saha added a comment -

          Would be good to see an overall design document, specially for the tricky pieces like reconnecting existing running containers to new app attempts.

          Show
          Bikas Saha added a comment - Would be good to see an overall design document, specially for the tricky pieces like reconnecting existing running containers to new app attempts.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Would be good to see an overall design document..

          Yup, writing something up..

          Show
          Vinod Kumar Vavilapalli added a comment - Would be good to see an overall design document.. Yup, writing something up..
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Here's a crack at it.

          Show
          Vinod Kumar Vavilapalli added a comment - Here's a crack at it.
          Hide
          Zhijie Shen added a comment -

          Thanks Vinod for the proposal. One thought when I read the following point.

          In case of apps like MapReduce where containers need to communicate directly with AMs, the old running-containers don’t know where the new ApplicationMaster is running and how to reach it (service addresses).

          During AM restarting, the container may try to send messages to AM in some application, and these messages may get lost. Is good to buffer the outstanding messages and send them to AM when rebinding?

          Show
          Zhijie Shen added a comment - Thanks Vinod for the proposal. One thought when I read the following point. In case of apps like MapReduce where containers need to communicate directly with AMs, the old running-containers don’t know where the new ApplicationMaster is running and how to reach it (service addresses). During AM restarting, the container may try to send messages to AM in some application, and these messages may get lost. Is good to buffer the outstanding messages and send them to AM when rebinding?
          Hide
          Bikas Saha added a comment -

          Here is an idea:
          The RM allows the app to send it some data during registration. This data could include the AM port information etc. The RM could then sync this data with the NM during NM heartbeat. The NM anyways maintain per app attempt info and this data would be added to that. The containers running on an AM could query for this attempt data and get the information about the new app attempt. This would be a scalable and efficient solution.
          The data per NM will be small since the data would be size checked and proportional to the app attempts. The NM could give access to an attempts data only to the containers that belong to that attempt. Only local containers should be able to communicate with their NM for such information. This could be done via a local access token that is supplied by the NM whenever it launches a container.

          Show
          Bikas Saha added a comment - Here is an idea: The RM allows the app to send it some data during registration. This data could include the AM port information etc. The RM could then sync this data with the NM during NM heartbeat. The NM anyways maintain per app attempt info and this data would be added to that. The containers running on an AM could query for this attempt data and get the information about the new app attempt. This would be a scalable and efficient solution. The data per NM will be small since the data would be size checked and proportional to the app attempts. The NM could give access to an attempts data only to the containers that belong to that attempt. Only local containers should be able to communicate with their NM for such information. This could be done via a local access token that is supplied by the NM whenever it launches a container.
          Hide
          Steve Loughran added a comment -

          regarding the rebinding problem, YARN-913 proposes some registry where we restrict the names of services and apps, and require uniqueness. This lets us register something like (hoya, stevel, accumulo5) and then let a client app look it up.

          Today we have the list of running apps, and you can find and bind to one, but

          1. there's nothing to stop a single user having >1 instance of the same name
          2. there's no way for a AM to enumerate this as the list operation isn't in the AMRM protocol
          Show
          Steve Loughran added a comment - regarding the rebinding problem, YARN-913 proposes some registry where we restrict the names of services and apps, and require uniqueness. This lets us register something like (hoya, stevel, accumulo5) and then let a client app look it up. Today we have the list of running apps, and you can find and bind to one, but there's nothing to stop a single user having >1 instance of the same name there's no way for a AM to enumerate this as the list operation isn't in the AMRM protocol
          Hide
          Steve Loughran added a comment -

          Actually, the simplest way for an AM to work with a restarted cluster would be if there was a blocking operation to list active containers. At startup it could get that list and use it to init its data structures -on a first start the list would be empty.

          Alternatively, the restart information could be passed down in RegisterApplicationMasterResponse -which would avoid adding any new RPC calls

          Show
          Steve Loughran added a comment - Actually, the simplest way for an AM to work with a restarted cluster would be if there was a blocking operation to list active containers. At startup it could get that list and use it to init its data structures -on a first start the list would be empty. Alternatively, the restart information could be passed down in RegisterApplicationMasterResponse -which would avoid adding any new RPC calls
          Hide
          Bikas Saha added a comment -

          The POR is the attempt AMRM register RPC to return the currently running containers for that app. So when the attempt makes the initial sync with the RM then it will get all that info.

          Show
          Bikas Saha added a comment - The POR is the attempt AMRM register RPC to return the currently running containers for that app. So when the attempt makes the initial sync with the RM then it will get all that info.
          Hide
          Bikas Saha added a comment -

          We need to come to a conclusion on how to allow the containers to also find out about the new AM's.
          Something we have discussed in the past
          1) New AM upon register provides an payload to the RM
          2) RM syncs the payload with the NMs on heartbeat. RM-NM already sync on running application state. This payload could piggyback on that.
          3) A container on an NM could query the NM about its own AM's payload. This local API could be secured by a local token and available to only containers running on the local node.
          4) This payload would be used by the containers to reconnect with the AM (in case systems dont use external solutions like zookeeper for such tracking.

          This sounds reasonably light-weight, scalable and self-contained. All the interested parties would be informed within 2*(NmHeartbeat) time interval.

          Show
          Bikas Saha added a comment - We need to come to a conclusion on how to allow the containers to also find out about the new AM's. Something we have discussed in the past 1) New AM upon register provides an payload to the RM 2) RM syncs the payload with the NMs on heartbeat. RM-NM already sync on running application state. This payload could piggyback on that. 3) A container on an NM could query the NM about its own AM's payload. This local API could be secured by a local token and available to only containers running on the local node. 4) This payload would be used by the containers to reconnect with the AM (in case systems dont use external solutions like zookeeper for such tracking. This sounds reasonably light-weight, scalable and self-contained. All the interested parties would be informed within 2*(NmHeartbeat) time interval.
          Hide
          Karthik Kambatla added a comment -

          Created a couple of sub-tasks based on an offline discussion with Anubhav, Bikas, Jian and Vinod.

          Show
          Karthik Kambatla added a comment - Created a couple of sub-tasks based on an offline discussion with Anubhav, Bikas, Jian and Vinod.

            People

            • Assignee:
              Vinod Kumar Vavilapalli
              Reporter:
              Vinod Kumar Vavilapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

              • Created:
                Updated:

                Development