As a user, I have found the ability to manually speculate tasks via the website incredibly useful--so useful that I'm starting to worry about RSI given that each speculation takes a click to the task page, a click to the task, a click on speculate, and a click on the confirm dialog box. These are frequently lost-task-tracker failures, and Hadoop currently just sets a timeout on them.
But how am I beating the current system? I'm comparing some tasks' performance to other tasks in the same job:
1) If there is only one task (either map or reduce) always speculate. Maybe turn this off for clusters that have very few slots, but in the case of >1000 slots or so, this is trivial and would basically prevent jobs taking literally twice as long.
2) Collect data on other tasks in the same job. If 99% of mappers went from 0% complete to >0% complete in 5 seconds and it's been 5 minutes while the last 5% of mappers change, speculate them. Ditto reducers. Unbalanced data may cause these problems,
3) Collect data on delays. If a task doesn't improve its % complete in some timeframe determined by the other tasks for the same job, speculate the "hung" task.
...in other words, I agree that there is probably an easy way to model the failed tasks, but only from a modeling perspective. Getting the heuristics and models right and implementing them is probably much much more difficult than implemeting "hadoop job -speculate-task task_identifier_here."
But also, and implementing the latter is necessary to discover how and when the heuristics themselves are failing...giving users the ability to do this also gives admins the ability to see when users are doing this.