Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-666

[Umbrella] Support rolling upgrades in YARN

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.4-alpha
    • Fix Version/s: 2.6.0
    • Component/s: graceful, rolling upgrade
    • Labels:
      None

      Description

      Jira to track changes required in YARN to allow rolling upgrades, including documentation and possible upgrade routes.

      1. YARN_Rolling_Upgrades_v2.pdf
        73 kB
        Siddharth Seth
      2. YARN_Rolling_Upgrades.pdf
        73 kB
        Siddharth Seth

        Issue Links

        1.
        Data persisted in RM should be versioned Sub-task Resolved Junping Du
         
        2.
        TokenIdentifier serialization should consider Unknown fields Sub-task Closed Junping Du
         
        3.
        ContainerTokens sent from the RM to NM via the AM should be a byte field Sub-task Resolved Vinod Kumar Vavilapalli
         
        4.
        Add an Exception to indicate 'Maintenance' for NMs and add this to the JavaDoc for appropriate protocols Sub-task Resolved Unassigned
         
        5.
        Add an interface on the RM to move NMs into a maintenance state Sub-task Resolved Unassigned
         
        6.
        Add an option to drain the ResourceManager of all apps for upgrades Sub-task Resolved Arun C Murthy
         
        7.
        ResourceManager and NodeManager should check for a minimum allowed version Sub-task Closed Robert Parker
         
        8.
        Data persisted in NM should be versioned Sub-task Closed Junping Du
         
        9. Move containerMgrProxy from RM's AMLaunch to get rid of issues that new client talking with old server Sub-task Open Junping Du
         
        10.
        YARN RPC should support Protocol Version in client/server Sub-task Resolved Junping Du
         
        11.
        Data persistent in timelinestore should be versioned Sub-task Closed Junping Du
         
        12.
        ApplicationHistoryStore should be versioned Sub-task Resolved Junping Du
         
        13.
        Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common Sub-task Closed Junping Du
         
        14.
        MR job client cannot reconnect to AM after NM restart. Sub-task Closed Junping Du
         
        15. Provide Hadoop as a local resource (on HDFS) which can be used by other projects Sub-task Open Junping Du
         
        16.
        NMClient doesn't have retries for supporting rolling-upgrades Sub-task Resolved Jian He
         
        17.
        ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields Sub-task Closed Junping Du
         

          Activity

          Hide
          hitesh Hitesh Shah added a comment -

          +1 to getting this built out. As they say, the devil is in the details.

          Show
          hitesh Hitesh Shah added a comment - +1 to getting this built out. As they say, the devil is in the details.
          Hide
          sseth Siddharth Seth added a comment -

          Initial writeup on changes that will be required, upgrade scenarios, some notes on compatible changes etc.

          Show
          sseth Siddharth Seth added a comment - Initial writeup on changes that will be required, upgrade scenarios, some notes on compatible changes etc.
          Hide
          kkambatl Karthik Kambatla (Inactive) added a comment -

          Sid - thanks for creating this. Excited.

          Just went over the design doc (which BTW is well-articulated) and have the following comments:

          1. Steps to upgrade a YARN cluster: do you think it would make sense to upgrade the NMs first before upgrading the RM. If something goes wrong (hopefully not), users can fall-back to the older version.
          2. Considerations (Upgrading the MR runtime): Until YARN/MR go into separate projects and release cycles, upgrading YARN alone (say 2.1.0 to 2.1.2) shouldn't affect the clients (MR) - no?
          3. Looks like we need to come up with an appropriate policy for "YARN data formats" in HADOOP-9517.
          4. I am assuming the version check will be similar to the one in HDFS-2983.
          5. Big +1 to "drain decommission"
          Show
          kkambatl Karthik Kambatla (Inactive) added a comment - Sid - thanks for creating this. Excited. Just went over the design doc (which BTW is well-articulated) and have the following comments: Steps to upgrade a YARN cluster: do you think it would make sense to upgrade the NMs first before upgrading the RM. If something goes wrong (hopefully not), users can fall-back to the older version. Considerations (Upgrading the MR runtime): Until YARN/MR go into separate projects and release cycles, upgrading YARN alone (say 2.1.0 to 2.1.2) shouldn't affect the clients (MR) - no? Looks like we need to come up with an appropriate policy for "YARN data formats" in HADOOP-9517 . I am assuming the version check will be similar to the one in HDFS-2983 . Big +1 to "drain decommission"
          Hide
          sseth Siddharth Seth added a comment -

          Steps to upgrade a YARN cluster: do you think it would make sense to upgrade the NMs first before upgrading the RM. If something goes wrong (hopefully not), users can fall-back to the older version.

          This really depends. There are situations which involve only an NM bug-fix. For such cases, the RM doesn't even need to be upgraded/restarted. Also depends on whether new APIs are being added to the RM which upgraded NMs may use.

          Considerations (Upgrading the MR runtime): Until YARN/MR go into separate projects and release cycles, upgrading YARN alone (say 2.1.0 to 2.1.2) shouldn't affect the clients (MR) - no?

          This depends upon individual deployments. Sites may choose to deploy YARN/MR in a way where they can be upgraded independently. The same example - MR 2.1.2 which contains AM/MR runtime fixes running against YARN 2.1.0. That's one of the main goals of MR being user-land code. Until work preserving restart is implemented, there should be a way to upgrade MR without affecting the cluster.

          I am assuming the version check will be similar to the one in HDFS-2983.

          We can definitely learn from that - if we want to support more specific versions than just the ones on individual protcols. I don't think YARN has any version checks at the moment, other than the ones performed on API versions by the RPC layer.

          Show
          sseth Siddharth Seth added a comment - Steps to upgrade a YARN cluster: do you think it would make sense to upgrade the NMs first before upgrading the RM. If something goes wrong (hopefully not), users can fall-back to the older version. This really depends. There are situations which involve only an NM bug-fix. For such cases, the RM doesn't even need to be upgraded/restarted. Also depends on whether new APIs are being added to the RM which upgraded NMs may use. Considerations (Upgrading the MR runtime): Until YARN/MR go into separate projects and release cycles, upgrading YARN alone (say 2.1.0 to 2.1.2) shouldn't affect the clients (MR) - no? This depends upon individual deployments. Sites may choose to deploy YARN/MR in a way where they can be upgraded independently. The same example - MR 2.1.2 which contains AM/MR runtime fixes running against YARN 2.1.0. That's one of the main goals of MR being user-land code. Until work preserving restart is implemented, there should be a way to upgrade MR without affecting the cluster. I am assuming the version check will be similar to the one in HDFS-2983 . We can definitely learn from that - if we want to support more specific versions than just the ones on individual protcols. I don't think YARN has any version checks at the moment, other than the ones performed on API versions by the RPC layer.
          Hide
          sseth Siddharth Seth added a comment -

          Updated doc with a list of authors.

          Show
          sseth Siddharth Seth added a comment - Updated doc with a list of authors.
          Hide
          curino Carlo Curino added a comment -

          This seems a very important problem (and a very hard one too).

          Just to toss one more idea around: I think that an HDFS-based shuffle (we are playing around with it and performance are much better than expected)
          could simplify some of the problems, as we could piggyback on datanode decomissioning mechanics to migrate intermediate data out of a node being decomissioned.
          And (a bit obvious) preemption could be a good tool to make the "draining" fast without wasting work (the "administrative" scenarios we mentioned during the conversation in YARN-45).

          Show
          curino Carlo Curino added a comment - This seems a very important problem (and a very hard one too). Just to toss one more idea around: I think that an HDFS-based shuffle (we are playing around with it and performance are much better than expected) could simplify some of the problems, as we could piggyback on datanode decomissioning mechanics to migrate intermediate data out of a node being decomissioned. And (a bit obvious) preemption could be a good tool to make the "draining" fast without wasting work (the "administrative" scenarios we mentioned during the conversation in YARN-45 ).
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Just to toss one more idea around: I think that an HDFS-based shuffle (we are playing around with it and performance are much better than expected)

          Carlo, it will be great if you share some numbers

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Just to toss one more idea around: I think that an HDFS-based shuffle (we are playing around with it and performance are much better than expected) Carlo, it will be great if you share some numbers
          Hide
          curino Carlo Curino added a comment -

          Hi Vinod, I will give you some numbers but bare in mind that these results are very initial, based only on a handful of runs on a 9 or 10 machine cluster, and without serious tuning of terasort.

          The idea of the solution is for maps to write their output directly into HDFS (e.g., with replication turned down to 1). Reducers will be started only when maps complete and stream-merge straight out of HDFS (bypassing much of the partial merging logic).

          Key limitations of what we have for now:
          1) if a map output is lost, all reducers will have to wait for it to be re-run
          2) we have lots of dfsclients open, this might become a problem for HDFS if you have too many maps per node.

          We initially tried this as a way to make checkpointing cheaper (no need to save any state other than last-processed key), and we were just hoping for it not too be too much worse than regular shuffle. The surprise I mentioned above was that we actually observe a surprisingly substantial speed up on a simple sort job (on 9 nodes): 25% at 64GB scale and 31% at 1TB scale.

          This seems to indicate that the penalty of reading through HDFS is actually trumped by the benefits of doing a stream-merge (where data never touch disk on the reduce side, other than for reducer output). Probably this is reducing seeks, and using the drives from which we read and we write more efficiently. You can imagine to get similar benefits by adding restartability to the http client (and the buffering done by HDFS client, which was likely to be beneficial in our test). More sophisticated versions of these could also dynamically decide whether to stream merge from a certain map or whether to copy the data (if for example they are small to fit in memory).

          Bottomline, I don't think we should read to much out these results (again very initial), other than using HDFS for intermediate data layer is not completely infeasible.

          Show
          curino Carlo Curino added a comment - Hi Vinod, I will give you some numbers but bare in mind that these results are very initial, based only on a handful of runs on a 9 or 10 machine cluster, and without serious tuning of terasort. The idea of the solution is for maps to write their output directly into HDFS (e.g., with replication turned down to 1). Reducers will be started only when maps complete and stream-merge straight out of HDFS (bypassing much of the partial merging logic). Key limitations of what we have for now: 1) if a map output is lost, all reducers will have to wait for it to be re-run 2) we have lots of dfsclients open, this might become a problem for HDFS if you have too many maps per node. We initially tried this as a way to make checkpointing cheaper (no need to save any state other than last-processed key), and we were just hoping for it not too be too much worse than regular shuffle. The surprise I mentioned above was that we actually observe a surprisingly substantial speed up on a simple sort job (on 9 nodes): 25% at 64GB scale and 31% at 1TB scale. This seems to indicate that the penalty of reading through HDFS is actually trumped by the benefits of doing a stream-merge (where data never touch disk on the reduce side, other than for reducer output). Probably this is reducing seeks, and using the drives from which we read and we write more efficiently. You can imagine to get similar benefits by adding restartability to the http client (and the buffering done by HDFS client, which was likely to be beneficial in our test). More sophisticated versions of these could also dynamically decide whether to stream merge from a certain map or whether to copy the data (if for example they are small to fit in memory). Bottomline, I don't think we should read to much out these results (again very initial), other than using HDFS for intermediate data layer is not completely infeasible.
          Hide
          lohit Lohit Vijayarenu added a comment -

          This looks good. Few minor point/JIRAs against metrics, reporting and UI pages updates with different version of yarn daemon should also be included. As Karthik already mentioned, it would be very useful if this followed HDFS-2983. This will become very useful for people who manage and do rolling upgrades on cluster.

          Another question regarding draining of NodeManager. Do we have a concept of Blacklisting NodeManager today? Reason I ask is, if we know we can afford to kill running apps on nodemanager, but do not want new jobs to be submitted, one could potentially use blacklisting.

          Show
          lohit Lohit Vijayarenu added a comment - This looks good. Few minor point/JIRAs against metrics, reporting and UI pages updates with different version of yarn daemon should also be included. As Karthik already mentioned, it would be very useful if this followed HDFS-2983 . This will become very useful for people who manage and do rolling upgrades on cluster. Another question regarding draining of NodeManager. Do we have a concept of Blacklisting NodeManager today? Reason I ask is, if we know we can afford to kill running apps on nodemanager, but do not want new jobs to be submitted, one could potentially use blacklisting.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Carlo Curino, thanks for the update, interesting stuff. I think we should pursue this route and do some experiments. Much much easier to do these experiments in 2.x given the YARN and MR separation. May be there's already a ticket for this. Will it be possible to put up your changes however 'hacky' they might be?

          Lohit Vijayarenu, we have per node health check monitoring which blocks bad nodes. There isn't any other concept of blacklisting NMs today, that is the reason for the proposal to add a decommission.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Carlo Curino , thanks for the update, interesting stuff. I think we should pursue this route and do some experiments. Much much easier to do these experiments in 2.x given the YARN and MR separation. May be there's already a ticket for this. Will it be possible to put up your changes however 'hacky' they might be? Lohit Vijayarenu , we have per node health check monitoring which blocks bad nodes. There isn't any other concept of blacklisting NMs today, that is the reason for the proposal to add a decommission.
          Hide
          curino Carlo Curino added a comment -

          Vinod, I completely agree YARN/MR separation makes hacking around this much simpler.

          As soon as we are done polishing/publishing the rest of checkpointing/preemption we will work on rebasing this code and we will post what we have.
          Also we are happy to socialize this, both development and experiments. For us this was a step towards cheaper checkpointing (as an hdfs-based shuffle
          is almost stateless for checkpoint purposes), but the performance wins are clearly interesting and there is quite a bit of variations you can think
          of (e.g., a hybrid strategy using both streaming and localized data etc.. fun stuff).

          By the way some of the refactorings we propose in MAPREDUCE-5192 and MAPREDUCE-5194 are (aside from their use in checkpointing) useful towards this.

          Show
          curino Carlo Curino added a comment - Vinod, I completely agree YARN/MR separation makes hacking around this much simpler. As soon as we are done polishing/publishing the rest of checkpointing/preemption we will work on rebasing this code and we will post what we have. Also we are happy to socialize this, both development and experiments. For us this was a step towards cheaper checkpointing (as an hdfs-based shuffle is almost stateless for checkpoint purposes), but the performance wins are clearly interesting and there is quite a bit of variations you can think of (e.g., a hybrid strategy using both streaming and localized data etc.. fun stuff). By the way some of the refactorings we propose in MAPREDUCE-5192 and MAPREDUCE-5194 are (aside from their use in checkpointing) useful towards this.
          Hide
          sseth Siddharth Seth added a comment -

          TBD - handling of Enum fields like AMCommand, NodeAction. This may be possible by forcing defaults if a new value needs to be added, alternately define a new Enum which is used by newer clients.

          Show
          sseth Siddharth Seth added a comment - TBD - handling of Enum fields like AMCommand, NodeAction. This may be possible by forcing defaults if a new value needs to be added, alternately define a new Enum which is used by newer clients.
          Hide
          djp Junping Du added a comment -

          Link to two related JIRAs - work preserving during RM and NM restart.

          Show
          djp Junping Du added a comment - Link to two related JIRAs - work preserving during RM and NM restart.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          2.6.0 release has most of the functionality originally targeted in the design doc.

          Resolving this umbrella JIRA.

          • As new issues come in, we can open new tickets.
          • Will leave the open sub-tasks as they are.

          Tx to every one who contributed to this massive effort!

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - 2.6.0 release has most of the functionality originally targeted in the design doc. Resolving this umbrella JIRA. As new issues come in, we can open new tickets. Will leave the open sub-tasks as they are. Tx to every one who contributed to this massive effort!
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing old tickets that are already part of a release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing old tickets that are already part of a release.
          Hide
          brahmareddy Brahma Reddy Battula added a comment - - edited

          Sorry for coming late.Did not seen any documentation for this.. I feel, it will be good if rolling upgrade/downgrade/rollback process documented like hdfs..

          Show
          brahmareddy Brahma Reddy Battula added a comment - - edited Sorry for coming late.Did not seen any documentation for this.. I feel, it will be good if rolling upgrade/downgrade/rollback process documented like hdfs..

            People

            • Assignee:
              Unassigned
              Reporter:
              sseth Siddharth Seth
            • Votes:
              0 Vote for this issue
              Watchers:
              33 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development