Thanks for taking a look, Bikas.
I hesitate at the idea of introducing more events, unless absolutely necessary. FAILED seems to be a good enough event to inform the task that an attempt has failed. FAIL_FETCH_FAILURE now means for any other kind of failure in the future I would have to introduce more events and introduce more arcs in the state machine. Large state machines are hard to understand.
I agree that we should strive to minimize the number of events, but the two types of failure are qualitatively different (failure of task attempts vs. failure due to fetch errors long after the task attempts have completed). Also, we already have TaskAttemptEventType.TA_TOO_MANY_FETCH_FAILURE, which this changes mirrors.
What surprises me is that the success of an attempt did not end up terminating the concurrent attempts. I would expect the speculative attempt to be Killed during the call to AttemptTransitionSucceeded(). Did that not work?
No, the speculative attempt was the one that succeeded, so it wasn't killed.
Or was there a close race condition that the speculative task failed at the same time as the succeeded attempt completed? So the KILLED event from the SUCCEEDED task raced against the FAILED event in the TaskAttemptImpl state machine, with the FAILED event winning?
The original attempt failed after the speculative attempt succeeded, and overwrote the SUCCEEDED state. I'm not sure this is a race condition, so much as an error in the allowed state machine transitions.
What do you think about the following approach? Allow MapRetroActiveFailureTransition to return SUCCEEDED as a possible state. In that transition, if the failed attempt is not the same as the attempt that SUCCEEDED, then that failure would not change the state of the Task. Task would remain in succeeded state. We can rename MapRetroActiveFailureTransition to something more appropriate.
That would work (I wrote a patch for it, but JIRA won't let me attach it). All things being equal, however, I'd rather add more events or transitions to the state machine, than special case the logic for an event.