It's a little unfortunate that
YARN-3946 started putting non-fatal messages into what is typically an app-driven diagnostic repository. Now all applications will start getting these (probably mostly annoying) messages for every job completion, assuming that most app frameworks dump the diagnostic strings when the application completes. It seems these new messages only make sense to report when the job is active and are mostly noise afterwards.
Back to the MapReduce side of this, IMHO we need to return diagnostics for any case where we used to return diagnostics before. Since this is specific to MapReduce, we can check the MR AM to see all the places where we could set a diagnostic. Most places I found only set the diagnostic when the job fails, but I did find at least one place where the diagnostic could be set yet the job could succeed. When a task fails a job diagnostic is added, see JobImpl.TaskCompletedTransition#taskFailed. If the user configured the job to allow some tasks to fail yet the job can succeed then we could end up with a successful job with some task failure messages in the diagnostics.
However that's a relatively rare config for a typical MapReduce job, and I'm not sure how many downstream software stacks are going to start getting upset when they see getFailureInfo start returning data on a regular basis for successful jobs. It's rather unfortunate that the method is called getFailureInfo and will now always contain messages unrelated to any failure. The downstream stacks should be checking the overall job status and not empty/non-empty on the getFailureInfo result to know whether the job really did fail or not, so on one hand I'm leaning towards reporting them on success as well. But then part of me thinks it will simply be annoying to have every job dump a bunch of messages on waiting to schedule, waiting to register, etc. on every successful job, which leads me to wonder if we really want
YARN-3946 to work the way it does.