Hadoop YARN
  1. Hadoop YARN
  2. YARN-896

Roll up for long-lived services in YARN

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      YARN is intended to be general purpose, but it is missing some features to be able to truly support long lived applications and long lived containers.

      This ticket is intended to

      1. discuss what is needed to support long lived processes
      2. track the resulting JIRA.

        Issue Links

          Activity

          Hide
          Steve Loughran added a comment -

          Link to YARN-3418: AM to be able to update web and IPC bindings post-registration

          Show
          Steve Loughran added a comment - Link to YARN-3418 : AM to be able to update web and IPC bindings post-registration
          Hide
          Steve Loughran added a comment -

          YARN-3417: AM to be able to exit with a request to be restarted.

          Show
          Steve Loughran added a comment - YARN-3417 : AM to be able to exit with a request to be restarted.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          I created a jira YARN-3130 for excessive progress report logs in application master. I think the log issue is very important for a long running service in YARN. Should YARN-3130 be linked to this JIRA?

          Tx for filing YARN-3130. I am moving it to MapReduce project. I think it is orthogonal to this JIRA, so we can keep it separate.

          Show
          Vinod Kumar Vavilapalli added a comment - I created a jira YARN-3130 for excessive progress report logs in application master. I think the log issue is very important for a long running service in YARN. Should YARN-3130 be linked to this JIRA? Tx for filing YARN-3130 . I am moving it to MapReduce project. I think it is orthogonal to this JIRA, so we can keep it separate.
          Hide
          Jian Fang added a comment -

          I created a jira YARN-3130 for excessive progress report logs in application master. I think the log issue is very important for a long running service in YARN. Should YARN-3130 be linked to this JIRA?

          Show
          Jian Fang added a comment - I created a jira YARN-3130 for excessive progress report logs in application master. I think the log issue is very important for a long running service in YARN. Should YARN-3130 be linked to this JIRA?
          Hide
          john lilley added a comment -

          Grreetings! Arun pointed me to this JIRA to see if this could potentially meet our needs. We are an ISV that currently ships a data-quality/integration suite running as a native YARN application. We are finding several use cases that would benefit from being able to manage a per-node persistent service. MapReduce has its “shuffle auxiliary service”, but it isn’t straightforward to add auxiliary services because they cannot be loaded from HDFS, so we’d have to manage the distribution of JARs across nodes (please tell me if I’m wrong here…).

          This seems to be addressing a lot of the issues around persistent services, and frankly I'm out of my depth in this discussion. But if you all can help me understand if this might help our situation, I'd be happy to have our team put shoulder to the wheel and help advance the development. Please comment our contemplated use case and help me understand if this is the right place to be.

          Our software doesn't use MapReduce. It is a pure YARN application that is basically a peer to MapReduce. There are a lot of reasons for this decision, but the main one is that we have a large code base that already executes data transformations in a single-server environment, and we wanted to produce a product without rewriting huge swaths of code. Given that, our software takes care of many things usually delegated to MapReduce, including distributed sort/partition (i.e. "the shuffle"). However, MapReduce has a special place in the ecosystem, in that it creates an auxiliary service to handle the distribution of shuffle data to reducers. It doesn't look like third-party apps have an easy time installing aux services. The JARs for any such service must be in Hadoop's classpath on all nodes at startup, creating both a management issue and a trust/security issue. Currently our software places temporary data into HDFS for this purpose, but we've found that HDFS has a huge overhead in terms of performance and file handles, even at low replication. We desire to replace the use of HDFS with a lighter-weight service to manage temp files and distribute their data.

          Show
          john lilley added a comment - Grreetings! Arun pointed me to this JIRA to see if this could potentially meet our needs. We are an ISV that currently ships a data-quality/integration suite running as a native YARN application. We are finding several use cases that would benefit from being able to manage a per-node persistent service. MapReduce has its “shuffle auxiliary service”, but it isn’t straightforward to add auxiliary services because they cannot be loaded from HDFS, so we’d have to manage the distribution of JARs across nodes (please tell me if I’m wrong here…). This seems to be addressing a lot of the issues around persistent services, and frankly I'm out of my depth in this discussion. But if you all can help me understand if this might help our situation, I'd be happy to have our team put shoulder to the wheel and help advance the development. Please comment our contemplated use case and help me understand if this is the right place to be. Our software doesn't use MapReduce. It is a pure YARN application that is basically a peer to MapReduce. There are a lot of reasons for this decision, but the main one is that we have a large code base that already executes data transformations in a single-server environment, and we wanted to produce a product without rewriting huge swaths of code. Given that, our software takes care of many things usually delegated to MapReduce, including distributed sort/partition (i.e. "the shuffle"). However, MapReduce has a special place in the ecosystem, in that it creates an auxiliary service to handle the distribution of shuffle data to reducers. It doesn't look like third-party apps have an easy time installing aux services. The JARs for any such service must be in Hadoop's classpath on all nodes at startup, creating both a management issue and a trust/security issue. Currently our software places temporary data into HDFS for this purpose, but we've found that HDFS has a huge overhead in terms of performance and file handles, even at low replication. We desire to replace the use of HDFS with a lighter-weight service to manage temp files and distribute their data.
          Hide
          Steve Loughran added a comment -

          Link to YARN-1489: umbrella JAR for work preserving AM restart

          Show
          Steve Loughran added a comment - Link to YARN-1489 : umbrella JAR for work preserving AM restart
          Hide
          Steve Loughran added a comment -

          Link to YARN-1394: RM to inform NMs when a container completed due to planned/unplanned NM outage

          Show
          Steve Loughran added a comment - Link to YARN-1394 : RM to inform NMs when a container completed due to planned/unplanned NM outage
          Hide
          Steve Loughran added a comment -

          Link to YARN-810, CGroup limits for CPU

          Show
          Steve Loughran added a comment - Link to YARN-810 , CGroup limits for CPU
          Hide
          Steve Loughran added a comment -

          Link to YARN-796 node labels

          Show
          Steve Loughran added a comment - Link to YARN-796 node labels
          Hide
          Steve Loughran added a comment -

          Link to YARN-1200: AM to hold central view of the rack topo that AMs can query & detect when its changed

          Show
          Steve Loughran added a comment - Link to YARN-1200 : AM to hold central view of the rack topo that AMs can query & detect when its changed
          Hide
          Steve Loughran added a comment -

          Link to YARN-326, add more resource restrictions

          Show
          Steve Loughran added a comment - Link to YARN-326 , add more resource restrictions
          Hide
          Steve Loughran added a comment -

          Linking to (formerly MAPREDUCE-4277) JIRA YARN-1160 issue for some users to have the right to demand a container on a (live) node, so giving them absolute control of the placement of specific services. This isn't for compute work, more the deployment of monitoring and test services, or, as in YARN-1151, auxiliary services to the NN

          Show
          Steve Loughran added a comment - Linking to (formerly MAPREDUCE-4277 ) JIRA YARN-1160 issue for some users to have the right to demand a container on a (live) node, so giving them absolute control of the placement of specific services. This isn't for compute work, more the deployment of monitoring and test services, or, as in YARN-1151 , auxiliary services to the NN
          Hide
          Jason Lowe added a comment -

          Chris, feel free to file a JIRA for rolling of stdout and stderr and we can look into what it will take to support that properly.

          Steve Loughran recently filed YARN-1104 as a subtask of this JIRA which covers the NM rolling stdout/stderr. We can transmute that JIRA into whatever ends up rolling the logs if it's not the NM.

          Show
          Jason Lowe added a comment - Chris, feel free to file a JIRA for rolling of stdout and stderr and we can look into what it will take to support that properly. Steve Loughran recently filed YARN-1104 as a subtask of this JIRA which covers the NM rolling stdout/stderr. We can transmute that JIRA into whatever ends up rolling the logs if it's not the NM.
          Hide
          Robert Joseph Evans added a comment -

          I agree that providing a good way handle stdout and stderr is important. I don't know if I want the NM to be doing this for us though, but that is an implementation detail that we can talk about on the follow up JIRA. Chris, feel free to file a JIRA for rolling of stdout and stderr and we can look into what it will take to support that properly.

          Show
          Robert Joseph Evans added a comment - I agree that providing a good way handle stdout and stderr is important. I don't know if I want the NM to be doing this for us though, but that is an implementation detail that we can talk about on the follow up JIRA. Chris, feel free to file a JIRA for rolling of stdout and stderr and we can look into what it will take to support that properly.
          Hide
          Steve Loughran added a comment -

          Added YARN-1111, example of log handling limits

          Show
          Steve Loughran added a comment - Added YARN-1111 , example of log handling limits
          Hide
          Steve Loughran added a comment -

          Add notification of AM if resource requests can't be met: RM or YarnClient to notify AMs if resource requests cannot be satisfied

          Show
          Steve Loughran added a comment - Add notification of AM if resource requests can't be met: RM or YarnClient to notify AMs if resource requests cannot be satisfied
          Hide
          Chris Riccomini added a comment -

          Because many of these systems roll their logs to avoid filling up disks we will probably need a protocol of some sort for the container to communicate with the Node Manager when logs are ready to be processed.

          Somewhat related, and a random thought: Samza is currently piping stdout and stderr to files when it sets its AM and container commands (... 1>logs/stdout 2>logs/stderr). These files never get rolled, and they can get very big for long-lived processes (10s-100s of gigs). We encourage using a logging system (log4j) to handle this stuff, but there are cases where the recommended use is not followed (sigh ).

          What do you guys think about tweaking the NM to consume the STDOUT/STDERR streams from the ProcessBuilder when it executes AM/container commands, writing them to stdout/stderr files, and rolling them periodically (e.g. daily)?

          Show
          Chris Riccomini added a comment - Because many of these systems roll their logs to avoid filling up disks we will probably need a protocol of some sort for the container to communicate with the Node Manager when logs are ready to be processed. Somewhat related, and a random thought: Samza is currently piping stdout and stderr to files when it sets its AM and container commands (... 1>logs/stdout 2>logs/stderr). These files never get rolled, and they can get very big for long-lived processes (10s-100s of gigs). We encourage using a logging system (log4j) to handle this stuff, but there are cases where the recommended use is not followed (sigh ). What do you guys think about tweaking the NM to consume the STDOUT/STDERR streams from the ProcessBuilder when it executes AM/container commands, writing them to stdout/stderr files, and rolling them periodically (e.g. daily)?
          Hide
          Chris Riccomini added a comment -

          Steve Loughran I've linked the JIRAs as relates to. The progress behavior you're describing is somewhat reasonable, but a bit unintuitive. Still feels like a hack. If that's the route we want to go, we should change the UI accordingly. If you think YARN-1079 is a dupe, feel free to close and update YARN-1039 with UI notes.

          Regarding CGroup limits, have a look at YARN-810. Might be related to what you're saying.

          Show
          Chris Riccomini added a comment - Steve Loughran I've linked the JIRAs as relates to. The progress behavior you're describing is somewhat reasonable, but a bit unintuitive. Still feels like a hack. If that's the route we want to go, we should change the UI accordingly. If you think YARN-1079 is a dupe, feel free to close and update YARN-1039 with UI notes. Regarding CGroup limits, have a look at YARN-810 . Might be related to what you're saying.
          Hide
          Steve Loughran added a comment -

          Chris -I use the bar today as measure of expected nodes vs actual; i.e. what percentage of the goal of work has been met -which is free to vary up and down w/node failures -the percent bar is free to go in both directions

          YARN-1039 already says "add a flag to say long-lived", so that future versions of YARN can behave differently. This could do more than GUI -in particular YARN-3 cgroup limits would be something you may want to turn on for services, to exactly limit their RAM & CPU to what they asked for. If a long-lived service underestimates its requirements the impact on the node is worse than if a short-lived container does it -for that you may want to be more forgiving.

          Show
          Steve Loughran added a comment - Chris -I use the bar today as measure of expected nodes vs actual; i.e. what percentage of the goal of work has been met -which is free to vary up and down w/node failures -the percent bar is free to go in both directions YARN-1039 already says "add a flag to say long-lived", so that future versions of YARN can behave differently. This could do more than GUI -in particular YARN-3 cgroup limits would be something you may want to turn on for services, to exactly limit their RAM & CPU to what they asked for. If a long-lived service underestimates its requirements the impact on the node is worse than if a short-lived container does it -for that you may want to be more forgiving.
          Hide
          Robert Joseph Evans added a comment -

          Chris Riccomini,

          That is a great point. To do this we need the application to somehow inform YARN that it is a long lived application. We could do this either through some sort of metadata that is submitted with the application to YARN, possibly through the service registry, or even perhaps just setting the progress to a special value like -1. I think I would prefer the first one, because then YARN could use that metadata later on for other things. After that the UI change should not be too difficult. If you want to file a JIRA for it, either as a sub task or just link it in, that would be great.

          Show
          Robert Joseph Evans added a comment - Chris Riccomini , That is a great point. To do this we need the application to somehow inform YARN that it is a long lived application. We could do this either through some sort of metadata that is submitted with the application to YARN, possibly through the service registry, or even perhaps just setting the progress to a special value like -1. I think I would prefer the first one, because then YARN could use that metadata later on for other things. After that the UI change should not be too difficult. If you want to file a JIRA for it, either as a sub task or just link it in, that would be great.
          Hide
          Chris Riccomini added a comment -

          Also, any idea what to do regarding long lived YARN processes (i.e. services that have no expected end) and the progress bar in YARN?

          Show
          Chris Riccomini added a comment - Also, any idea what to do regarding long lived YARN processes (i.e. services that have no expected end) and the progress bar in YARN?
          Hide
          Robert Joseph Evans added a comment -

          Sorry I have not responded sooner. I have been out on vacation and had a high severity issue that has consumed a lot of my time.

          Larry McCay and Thomas Weise There are many different services that long lived processes need to communicate with. Many of these services use tokens and others may not. Each of these tokens or other credentials are specific to the services being accessed. In some cases like with HBase we probably can take advantage of the existing renewal feature in the RM. With other tokens or credentials it may be different, and may require AM specific support for them. I am not really that concerned with solving the renewal problem for all possible credentials here, although if we can solve this for a lot of common tokens at the same time that would be great. What I care most about is being sure that a long lived YARN application does not necessarily have to stop and restart because an HDFS token cannot be renewed any longer. If there are changes going into the HDFS security model that would make YARN-941 unnecessary that is great. I have not had much time to follow the security discussion so thank you for pointing this out. But it is also a question of time frames. YARN-941 and YARN-1041 would allow for secure, robust, long lived applications on YARN, and do not appear to be that difficult to accomplish. Do you know the time frame for the security rework?

          Show
          Robert Joseph Evans added a comment - Sorry I have not responded sooner. I have been out on vacation and had a high severity issue that has consumed a lot of my time. Larry McCay and Thomas Weise There are many different services that long lived processes need to communicate with. Many of these services use tokens and others may not. Each of these tokens or other credentials are specific to the services being accessed. In some cases like with HBase we probably can take advantage of the existing renewal feature in the RM. With other tokens or credentials it may be different, and may require AM specific support for them. I am not really that concerned with solving the renewal problem for all possible credentials here, although if we can solve this for a lot of common tokens at the same time that would be great. What I care most about is being sure that a long lived YARN application does not necessarily have to stop and restart because an HDFS token cannot be renewed any longer. If there are changes going into the HDFS security model that would make YARN-941 unnecessary that is great. I have not had much time to follow the security discussion so thank you for pointing this out. But it is also a question of time frames. YARN-941 and YARN-1041 would allow for secure, robust, long lived applications on YARN, and do not appear to be that difficult to accomplish. Do you know the time frame for the security rework?
          Hide
          Steve Loughran added a comment -

          Services can benefit from the YARN-679 entry point, primarily as having a standard way to deploy YARN service class implementations with failure and signal handling built in.

          Show
          Steve Loughran added a comment - Services can benefit from the YARN-679 entry point, primarily as having a standard way to deploy YARN service class implementations with failure and signal handling built in.
          Hide
          Larry McCay added a comment -

          While I am missing some of the important context of how tokens are issued for these long lived containers, I can introduce another pattern for token use that may be of some interest.

          If when an application is submitted to the RM it included tokens that represent the application's identity and have a sufficiently long expiration date then they could be exchanged for shorter lived access tokens. Upon completion or being flagged as rogue the identity token can be revoked/invalidated at which time the bearer could no longer acquire access tokens with it. This pattern eliminates the finite lifespan issue that tokens such as the delegation token have and at the same time reduces the amount of damage that can be done with an access token. This pattern is being discussed as part of the Hadoop SSO efforts for user authentication which you can find at HADOOP-9533 and HADOOP-9392. I have also filed a JIRA and have a preliminary patch posted for a JsonWebToken for use in such a pattern: HADOOP-9781. It utilizes PKI based cryptography for signing and verifying the token which is supported with a dependency on JIRA HADOOP-9534 for a credential management framework.

          Show
          Larry McCay added a comment - While I am missing some of the important context of how tokens are issued for these long lived containers, I can introduce another pattern for token use that may be of some interest. If when an application is submitted to the RM it included tokens that represent the application's identity and have a sufficiently long expiration date then they could be exchanged for shorter lived access tokens. Upon completion or being flagged as rogue the identity token can be revoked/invalidated at which time the bearer could no longer acquire access tokens with it. This pattern eliminates the finite lifespan issue that tokens such as the delegation token have and at the same time reduces the amount of damage that can be done with an access token. This pattern is being discussed as part of the Hadoop SSO efforts for user authentication which you can find at HADOOP-9533 and HADOOP-9392 . I have also filed a JIRA and have a preliminary patch posted for a JsonWebToken for use in such a pattern: HADOOP-9781 . It utilizes PKI based cryptography for signing and verifying the token which is supported with a dependency on JIRA HADOOP-9534 for a credential management framework.
          Hide
          Steve Loughran added a comment -

          YARN-1011 - speculative containers- may be useful here too, you could have some speculative containers that may come and go alongside a set of static containers that have longer lifespans.

          Show
          Steve Loughran added a comment - YARN-1011 - speculative containers- may be useful here too, you could have some speculative containers that may come and go alongside a set of static containers that have longer lifespans.
          Hide
          Siddharth Seth added a comment -

          Robert Joseph Evans Applications may connect to other services such as HBase or issue tokens for communication between its own containers. All of these would require renewal.

          The RM takes care of renewing tokens for HDFS - it can do this since the HDFS token renewer class is in the RM's classpath. For other applications - Hive for example - this isn't possible. I believe Hive ends up issuing tokens which are valid for a longer duration to get around the renewal problem. I won't necessarily link this to long running YARN though - other than the bit about the token max-age.

          Show
          Siddharth Seth added a comment - Robert Joseph Evans Applications may connect to other services such as HBase or issue tokens for communication between its own containers. All of these would require renewal. The RM takes care of renewing tokens for HDFS - it can do this since the HDFS token renewer class is in the RM's classpath. For other applications - Hive for example - this isn't possible. I believe Hive ends up issuing tokens which are valid for a longer duration to get around the renewal problem. I won't necessarily link this to long running YARN though - other than the bit about the token max-age.
          Hide
          Thomas Weise added a comment -

          Robert Joseph Evans Applications may connect to other services such as HBase or issue tokens for communication between its own containers. All of these would require renewal.

          Show
          Thomas Weise added a comment - Robert Joseph Evans Applications may connect to other services such as HBase or issue tokens for communication between its own containers. All of these would require renewal.
          Hide
          Robert Joseph Evans added a comment -

          I filed one new JIRA for updating tokens in the RM YARN-941.

          I started to file a JIRA for the AM to be informed of the location of its already running containers, but as I was writing it I realized that it will not give us enough information to be able to reattach to the containers. The only thing it will give us is enough info to be able to go shoot the containers. Simply because there is no metadata about what port the container may be listening on or anything like that. It seems to me that we would be better off keeping a log, similar to the MR job history log, that has in it all the data the AM needs to look for running containers. If others see a different need for this API, I am still happy to file a JIRA for it.

          I have not filed a JIRA for anti-affinity yet either. I seem to remember another JIRA for something like this already, but I have not found it yet. I figure we can add in a long lived process flag for the scheduler when we run across a use case for it.

          The other parts discussed here, either already have a JIRA associated with the same functionality, or I think need a bit more discussion about exactly what we want to do. Namely log aggregation/processing and Hadoop "package" management/rolling upgrades of live applications.

          If I missed something please let me know.

          Show
          Robert Joseph Evans added a comment - I filed one new JIRA for updating tokens in the RM YARN-941 . I started to file a JIRA for the AM to be informed of the location of its already running containers, but as I was writing it I realized that it will not give us enough information to be able to reattach to the containers. The only thing it will give us is enough info to be able to go shoot the containers. Simply because there is no metadata about what port the container may be listening on or anything like that. It seems to me that we would be better off keeping a log, similar to the MR job history log, that has in it all the data the AM needs to look for running containers. If others see a different need for this API, I am still happy to file a JIRA for it. I have not filed a JIRA for anti-affinity yet either. I seem to remember another JIRA for something like this already, but I have not found it yet. I figure we can add in a long lived process flag for the scheduler when we run across a use case for it. The other parts discussed here, either already have a JIRA associated with the same functionality, or I think need a bit more discussion about exactly what we want to do. Namely log aggregation/processing and Hadoop "package" management/rolling upgrades of live applications. If I missed something please let me know.
          Hide
          Robert Joseph Evans added a comment -

          Thomas Weise I am not totally sure what you mean by app specific tokens. Is this tokens that the app is going to use to connect to other services like HBase? or is it something else?

          eric baldeschwieler and Enis Soztutar Rolling upgrades is a very interesting use case. We can definitely add in a ticket to support this type of thing. I agree that it needs to be thought through some, and is going to require help from both the AM and YARN to do it properly.

          Show
          Robert Joseph Evans added a comment - Thomas Weise I am not totally sure what you mean by app specific tokens. Is this tokens that the app is going to use to connect to other services like HBase? or is it something else? eric baldeschwieler and Enis Soztutar Rolling upgrades is a very interesting use case. We can definitely add in a ticket to support this type of thing. I agree that it needs to be thought through some, and is going to require help from both the AM and YARN to do it properly.
          Hide
          eric baldeschwieler added a comment -

          IMO, you should be able to run a new framework / service simply by dropping a tarball / jar / war sort of thing into a well know place and pointing to it in your Job invocation.

          I'm not sure what beyond this and the distributed cache Hoya would need to deploy HBase, but it would be great to get it to the point where you simply drop either just hoya package (that contains a version of HBase) or Hoya and a HBase tarball into HDFS.

          Let's discuss and make a proposal.

          Show
          eric baldeschwieler added a comment - IMO, you should be able to run a new framework / service simply by dropping a tarball / jar / war sort of thing into a well know place and pointing to it in your Job invocation. I'm not sure what beyond this and the distributed cache Hoya would need to deploy HBase, but it would be great to get it to the point where you simply drop either just hoya package (that contains a version of HBase) or Hoya and a HBase tarball into HDFS. Let's discuss and make a proposal.
          Hide
          Thomas Weise added a comment -

          We also identified the need for token renewal (app specific tokens). This should be a common need for long running services. Has it been discussed elsewhere?

          Show
          Thomas Weise added a comment - We also identified the need for token renewal (app specific tokens). This should be a common need for long running services. Has it been discussed elsewhere?
          Hide
          Enis Soztutar added a comment -

          I think we also need generic provisioning service from YARN as well. We talked briefly about this with Arun. From Hoya perspective, we would like to be able to use yarn to deploy multiple different versions of HBase in the same cluster, with different configurations, and possibly the hoya app master doing a rolling restart for updating running Hbase clusters. For this, what we need are tarball and configuration distribution (instead of the regular rpm install). Alternatively, we can also build this inside ambari provisioner, and wrap those API's in YARN, so that ambari will deploy YARN, and YARN will use ambari services to deploy other bits.

          Show
          Enis Soztutar added a comment - I think we also need generic provisioning service from YARN as well. We talked briefly about this with Arun. From Hoya perspective, we would like to be able to use yarn to deploy multiple different versions of HBase in the same cluster, with different configurations, and possibly the hoya app master doing a rolling restart for updating running Hbase clusters. For this, what we need are tarball and configuration distribution (instead of the regular rpm install). Alternatively, we can also build this inside ambari provisioner, and wrap those API's in YARN, so that ambari will deploy YARN, and YARN will use ambari services to deploy other bits.
          Hide
          Thomas Weise added a comment -

          Bobby, thanks for putting this together. Some items from the DataTorrent wish list (most already covered above):

          • gang scheduling (similar to YARN-624)
          • affinity, anti-affinity
          • return resource requests that cannot be met
          • attach restarted AM to existing containers
          • service registry
          Show
          Thomas Weise added a comment - Bobby, thanks for putting this together. Some items from the DataTorrent wish list (most already covered above): gang scheduling (similar to YARN-624 ) affinity, anti-affinity return resource requests that cannot be met attach restarted AM to existing containers service registry
          Hide
          Steve Loughran added a comment -

          Added YARN-913, need for a service registry. I'm abusing the RM for this, but there's a small race condition between a client check for a running instance of that name and it launching an instance of that name. I'd like something race-condition proof and able to hold more information about a service.

          Show
          Steve Loughran added a comment - Added YARN-913 , need for a service registry. I'm abusing the RM for this, but there's a small race condition between a client check for a running instance of that name and it launching an instance of that name. I'd like something race-condition proof and able to hold more information about a service.
          Hide
          Robert Joseph Evans added a comment -

          Chris, Yes I missed the app master retry issue. Those two with the discussion on them seem to cover what we are looking for.

          Show
          Robert Joseph Evans added a comment - Chris, Yes I missed the app master retry issue. Those two with the discussion on them seem to cover what we are looking for.
          Hide
          Chris Riccomini added a comment -

          This list sounds pretty good. One other thing to add would be the YARN-611 and YARN-614 tickets, which are useful for long-lived processes.

          Show
          Chris Riccomini added a comment - This list sounds pretty good. One other thing to add would be the YARN-611 and YARN-614 tickets, which are useful for long-lived processes.
          Hide
          Robert Joseph Evans added a comment -

          No comments in the past few days. I would like to hear from more people involved, even if it is just to say that it looks like we have everything covered here. Then we can start filing JIRAs and getting some work done.

          Show
          Robert Joseph Evans added a comment - No comments in the past few days. I would like to hear from more people involved, even if it is just to say that it looks like we have everything covered here. Then we can start filing JIRAs and getting some work done.
          Hide
          Steve Loughran added a comment -

          Based on our Hoya, HBase on YARN work:

          • we need a restarted AM to be given the existing set of containers from its previous instance. The use case there is region servers should stay up while the AM and master are restarted.
          • maybe: be able to warn YARN that the services will be long-lived. That could be used in scheduling and placement.
          • anti-affinity is needed to declare that different container instances SHOULD be deployed on different nodes (use case: region servers). If failure domains are supported in the topology, anti-affinity should use that. I don't know if we'd want best-effort vs absolute requirements.
          • add ability to increase requirements of running containers, e.g. say "this service is using more RAM than expected, reduce the amount available to others".
          • maybe: ability to send kill signals to container processes, to do a graceful kill before escalating. This is of limited value if an extra process (such as bin/hbase) intervenes in the startup process.

          There's also long-lived service discovery, a topic for another JIRA

          Show
          Steve Loughran added a comment - Based on our Hoya, HBase on YARN work: we need a restarted AM to be given the existing set of containers from its previous instance. The use case there is region servers should stay up while the AM and master are restarted. maybe: be able to warn YARN that the services will be long-lived. That could be used in scheduling and placement. anti-affinity is needed to declare that different container instances SHOULD be deployed on different nodes (use case: region servers). If failure domains are supported in the topology, anti-affinity should use that. I don't know if we'd want best-effort vs absolute requirements. add ability to increase requirements of running containers, e.g. say "this service is using more RAM than expected, reduce the amount available to others". maybe: ability to send kill signals to container processes, to do a graceful kill before escalating. This is of limited value if an extra process (such as bin/hbase ) intervenes in the startup process. There's also long-lived service discovery, a topic for another JIRA
          Hide
          Robert Joseph Evans added a comment -

          Another issue that has been discussed in the past is the impact that long lived processes can have on resource scheduling. It is possible for a long lived process to grab lots of resources and then never release them even though it is using more resources than it would be allowed to have when the cluster is full. Recent preemption changes should be able to prevent this from happening between different queues/pools, but we may need to think if we need more control about this within a queue.

          Show
          Robert Joseph Evans added a comment - Another issue that has been discussed in the past is the impact that long lived processes can have on resource scheduling. It is possible for a long lived process to grab lots of resources and then never release them even though it is using more resources than it would be allowed to have when the cluster is full. Recent preemption changes should be able to prevent this from happening between different queues/pools, but we may need to think if we need more control about this within a queue.
          Hide
          Robert Joseph Evans added a comment -

          During the most recent Hadoop Summit there was a developer meetup where we discussed some of these issues. This is to summarize what was discussed at that meeting and to add in a few things that have also been discussed on mailing lists and other places.

          HDFS delegation tokens have a maximum life time. Currently tokens submitted to the RM when the app master is launched will be renewed by the RM until the application finishes and the logs from the application have finished aggregating. The only token currently used by the YARN framework is the HDFS delegation token. This is used to read files from HDFS as part of the distributed cache and to write the aggregated logs out to HDFS.

          In order to support relaunching an app master after the HDFS the maximum lifetime of the HDFS delegation token, we either need to allow for tokens that do not expire or provide an API to allow the RM to replace the old token with a new one. Because removing the maximum lifetime of a token reduces the security of the cluster as a whole I think it would be better to provide an API to replace the token with a new one.

          If we want to continue supporting log aggregation we also need to provide a way for the Node Managers to get the new token too. It is assumed that each app master will also provide an API to get the new token so it can start using it.

          Log aggregation is another issue, although not required for long lived applications to work. Logs are aggregated into HDFS when the application finishes. This is not really that useful for applications that are never intended to exit. Ideally the processing of logs by the node manager should be pluggable so that clusters and applications can select how and when logs are processed/displayed to the end user. Because many of these systems roll their logs to avoid filling up disks we will probably need a protocol of some sort for the container to communicate with the Node Manager when logs are ready to be processed.

          Another issue is to allow containers to out live the app master that launched them and also to allow containers to outlive the node manager that launched them. This is especially critical for the stability of applications durring rolling upgrades to YARN.

          Show
          Robert Joseph Evans added a comment - During the most recent Hadoop Summit there was a developer meetup where we discussed some of these issues. This is to summarize what was discussed at that meeting and to add in a few things that have also been discussed on mailing lists and other places. HDFS delegation tokens have a maximum life time. Currently tokens submitted to the RM when the app master is launched will be renewed by the RM until the application finishes and the logs from the application have finished aggregating. The only token currently used by the YARN framework is the HDFS delegation token. This is used to read files from HDFS as part of the distributed cache and to write the aggregated logs out to HDFS. In order to support relaunching an app master after the HDFS the maximum lifetime of the HDFS delegation token, we either need to allow for tokens that do not expire or provide an API to allow the RM to replace the old token with a new one. Because removing the maximum lifetime of a token reduces the security of the cluster as a whole I think it would be better to provide an API to replace the token with a new one. If we want to continue supporting log aggregation we also need to provide a way for the Node Managers to get the new token too. It is assumed that each app master will also provide an API to get the new token so it can start using it. Log aggregation is another issue, although not required for long lived applications to work. Logs are aggregated into HDFS when the application finishes. This is not really that useful for applications that are never intended to exit. Ideally the processing of logs by the node manager should be pluggable so that clusters and applications can select how and when logs are processed/displayed to the end user. Because many of these systems roll their logs to avoid filling up disks we will probably need a protocol of some sort for the container to communicate with the Node Manager when logs are ready to be processed. Another issue is to allow containers to out live the app master that launched them and also to allow containers to outlive the node manager that launched them. This is especially critical for the stability of applications durring rolling upgrades to YARN.

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Joseph Evans
            • Votes:
              8 Vote for this issue
              Watchers:
              114 Start watching this issue

              Dates

              • Created:
                Updated:

                Development