[MAPREDUCE-2039] Improve speculative execution - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

In speculation, the framework issues a second task attempt on a task where one attempt is already running. This is useful if the running attempt is bogged down for reasons outside of the task's code, so a second attempt finishes ahead of the existing attempt, even though the first attempt has a head start.

Early versions of speculation had the weakness that an attempt that starts out well but breaks down near the end would never get speculated. That got fixed in HADOOP:2141 , but in the fix the speculation wouldn't engage until the performance of the old attempt, even counting the early portion where it progressed normally , was significantly worse than average.

I want to fix that by overweighting the more recent progress increments. In particular, I would like to use exponential smoothing with a lambda of approximately 1/minute [which is the time scale of speculative execution] to measure progress per unit time. This affects the speculation code in two places:

It affects the set of task attempts we consider to be underperforming
It affects our estimates of when we expect tasks to finish. This could be hugely important; speculation's main benefit is that it gets a single outlier task finished earlier than otherwise possible, and we need to know which task is the outlier as accurately as possible.

I would like a rich suite of configuration variables, minimally including lambda and possibly weighting factors. We might have two exponentially smoothed tracking variables of the progress rate, to diagnose attempts that are bogged down and getting worse vrs. bogging down but improving.

Perhaps we should be especially eager to speculate a second attempt. If a task is deterministically failing after bogging down [think "rare infinite loop bug"] we would rather take a couple of our attempts in parallel to discover the problem sooner.

As part of this patch we would like to add benchmarks that simulate rare tasks that behave poorly, so we can discover whether this change in the code is a good idea and what the proper configuration is. Early versions of this will be driven by our assumptions. Later versions will be driven by the fruits of MAPREDUCE:2037

Attachments

Issue Links

is related to

MAPREDUCE-2037 Capturing interim progress times, CPU usage, and memory usage, when tasks reach certain progress thresholds

Closed

HADOOP-2141 speculative execution start up condition based on completion time

Closed

relates to

MAPREDUCE-2063 We need a benchmark to model system behavior in the face of tasks with time-variant performance

Open

Activity

People

Assignee:: Dick King

Reporter:: Dick King

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 27/Aug/10 23:41

Updated:: 30/Jul/14 23:38

Resolved:: 30/Jul/14 23:38