[TEZ-3075] Revamp bad node handling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

The current logic around that is derived from MR and does not work in all cases.
Things to consider
1) Have a notion of probation where machines are put out of service for a period of time (say 5m, 15m and 30m) before being given up for good. This allows more graceful handling of temporary glitches.
2) Different handling for YARN marking a node as bad vs internal heuritics
3) Bad nodes should not immediately trigger re-execution of completed work. That should be based on presence of downstream consumers (ie existing demand for that output) and a reasonable indication by other consumers from that node that it cannot serve results (eg. multiple reports of read errors with that node as a source).

Attachments

Activity

People

Assignee:: Ying Han

Reporter:: Bikas Saha

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 26/Jan/16 18:04

Updated:: 16/Oct/18 14:04