Details
Description
The OutputCommitCoordinator was originally introduced in SPARK-4879 because speculation causes the output of some partitions to be deleted. However, as we can see in SPARK-10063, speculation is not the only case where this can happen.
More specifically, when we retry a stage we're not guaranteed to kill the tasks that are still running (we don't even interrupt their threads), so we may end up with multiple concurrent task attempts for the same task. This leads to problems like SPARK-8029, but this fix alone is necessary but not sufficient.
In general, when we run into situations like these, we need the OutputCommitCoordinator because we don't control what the underlying file system does. Enabling this doesn't induce heavy performance costs so there's little reason why we shouldn't always enable it to ensure correctness.