Thanks for taking a look at the patch, Sidd.
Handling FAILED / KILLED tasks from previous runs (Some jobs allow a percentage of tasks to fail)
Agreed. In the short-term to mitigate some additional risk it only tries to recover the same set of tasks as before. I'd prefer to handle this in a separate JIRA, but it should be easy to do it here as well. In addition to FAILED/KILLED tasks, we could also recover information for tasks that were RUNNING, marking their in-flight attempts as KILLED but we'd at least have their start times, the nodes they ran on, and pointers to their logs.
For Successful tasks, if recovery fails for the successful attempt (committer failure, etc) - should this be considered as a failure, and count towards the max-attempt limit ?
I debated this a bit when I wrote it and thought I'd rather give the task the benefit of the doubt and let it try again rather than fail it. Recovery isn't a "normal" part of the task flow, and I thought it would be better to give the task another attempt rather than use up one of the failed attempts if recovery encounters an error. I don't have strong feelings on it though. If the consensus is that it should count as an attempt failure then is a straightforward change to mark it as such.
Speculator info from the previous run could be recovered as well.
Yes, we should be able to reconstruct many, if not all, of the speculator events as well. I'd prefer to defer that to a separate JIRA.