Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
The current logic around that is derived from MR and does not work in all cases.
Things to consider
1) Have a notion of probation where machines are put out of service for a period of time (say 5m, 15m and 30m) before being given up for good. This allows more graceful handling of temporary glitches.
2) Different handling for YARN marking a node as bad vs internal heuritics
3) Bad nodes should not immediately trigger re-execution of completed work. That should be based on presence of downstream consumers (ie existing demand for that output) and a reasonable indication by other consumers from that node that it cannot serve results (eg. multiple reports of read errors with that node as a source).