Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
-
Flaky DNS cluster
Description
If DNS resolution fails for a period of 5-10 seconds, Tez restarts & contra-flows in the query triggering recovery of nearly everything it has run.
Nodes are getting marked as bad because they can't shuffle (dns resolution failed for all NMs), which results in log lines like
attempt_1394928384313_0234_1_25_000654_0 blamed for read error from attempt_1394928384313_0234_1_24_000366_0
And the tasks restart from an earlier vertex.
When a large number of such failures happen, the tasks shouldn't restart previous vertexes, but instead should flip a circuit & back-off till the network blip disappears.