Chris Trezzo and Gera Shegalov and I discussed more on this. We would like to give some updates and get feedback from others. Similar to what Robert suggested originally, we need to provide a way for AM to update the log aggregation policy when it stops the container.
One likely log aggregation policy for MRAppMaster is to log all failed tasks and sample logs of some successful tasks. What we found is container exitcode isn't a reliable indication whether a MR task finishes successfully. That is due to the fact MRAppMaster calls stopContainer while the YarnChild JVM exits by itself. Depending on the timing, you might get non-zero exitcode for successful tasks. So specifying the log aggregation policy up front during ContainerLaunchContext isn't enough.
The mechanism for AM to pass log aggregation policy to YARN needs to address different scenarios.
1. Containers exit by themselves. DistributedShell belongs to this category.
2. AM has to explicitly stop the containers. MR belongs to this category.
3. AM might want to inform NM to do on-demand log aggregation without stopping the container. This might be useful for some long running applications.
To support #1, we have to specify the log aggregation policy as part of startContainer call. Chris' patch handles that.
To support #2, AM has to indicate to NM whether the log aggregation is needed during stopContainer call. AM can uses different types of policies such as successful tasks sampling. For that, AM will specify the log aggregation policy as part of StopContainerRequest.
* Get the <code>ContainerLogAggregationPolicy</code> for the container.
* @return The <code>ContainerLogAggregationPolicy</code> for the container.
public ContainerLogAggregationPolicy getLogAggregationPolicy();
* Set the <code>ContainerLogAggregationPolicy</code> for the container.
* @param policy The <code>ContainerLogAggregationPolicy</code> for the container.
public void setLogAggregationPolicy(ContainerLogAggregationPolicy policy);
Alternatively we can define a new interface called ContainerStopContext to capture log aggregation policy and other information we want to include later, etc.
public abstract ContainerStopContext getContainerStopContext();
public abstract void setContainerStopContext(ContainerStopContext context);
To support #3, we need some new API such as updateContainer so that AM can ask NM to roll container log and update the log aggregation policy, etc.