Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.3.0
Description
There's a bug in:
/** Check whether a task is currently running an attempt on a given host */ private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = { taskAttempts(taskIndex).exists(_.host == host) }
This will ignore hosts which have finished attempts, so we should check whether the attempt is currently running on the given host.
And it is possible for a speculative task to run on a host where another attempt failed here before.
Assume we have only two machines: host1, host2. We first run task0.0 on host1. Then, due to a long time waiting for task0.0, we launch a speculative task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run since there's already a copy running on host2. After another long time, we launch a new speculative task0.2. And, now, we can run task0.2 on host1 again, since there's no more running attempt on host1.
******
After discussion in the PR, we simply make the comment be consistent the method's behavior. See details in PR#20998.