[TEZ-965] Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Flaky DNS cluster

Target Version/s:

0.8.6

Description

If DNS resolution fails for a period of 5-10 seconds, Tez restarts & contra-flows in the query triggering recovery of nearly everything it has run.

Nodes are getting marked as bad because they can't shuffle (dns resolution failed for all NMs), which results in log lines like

attempt_1394928384313_0234_1_25_000654_0 blamed for read error from attempt_1394928384313_0234_1_24_000366_0

And the tasks restart from an earlier vertex.

When a large number of such failures happen, the tasks shouldn't restart previous vertexes, but instead should flip a circuit & back-off till the network blip disappears.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Gopal Vijayaraghavan

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Mar/14 23:35

Updated:: 14/Mar/17 03:40