During the most recent Hadoop Summit there was a developer meetup where we discussed some of these issues. This is to summarize what was discussed at that meeting and to add in a few things that have also been discussed on mailing lists and other places.
HDFS delegation tokens have a maximum life time. Currently tokens submitted to the RM when the app master is launched will be renewed by the RM until the application finishes and the logs from the application have finished aggregating. The only token currently used by the YARN framework is the HDFS delegation token. This is used to read files from HDFS as part of the distributed cache and to write the aggregated logs out to HDFS.
In order to support relaunching an app master after the HDFS the maximum lifetime of the HDFS delegation token, we either need to allow for tokens that do not expire or provide an API to allow the RM to replace the old token with a new one. Because removing the maximum lifetime of a token reduces the security of the cluster as a whole I think it would be better to provide an API to replace the token with a new one.
If we want to continue supporting log aggregation we also need to provide a way for the Node Managers to get the new token too. It is assumed that each app master will also provide an API to get the new token so it can start using it.
Log aggregation is another issue, although not required for long lived applications to work. Logs are aggregated into HDFS when the application finishes. This is not really that useful for applications that are never intended to exit. Ideally the processing of logs by the node manager should be pluggable so that clusters and applications can select how and when logs are processed/displayed to the end user. Because many of these systems roll their logs to avoid filling up disks we will probably need a protocol of some sort for the container to communicate with the Node Manager when logs are ready to be processed.
Another issue is to allow containers to out live the app master that launched them and also to allow containers to outlive the node manager that launched them. This is especially critical for the stability of applications durring rolling upgrades to YARN.