[TEZ-4233] Map task should be blamed earlier for local fetch failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.10.1
Component/s: None
Labels:
None

Description

Fetch failures can be a result of network issue or disk issue. Currently, AM doesn't know about whether the original input read error happened because of a local fetch failure or not. I think if a map output was reported as a subject of local fetch failure, AM should respond earlier, and blame it as soon as possible. Here is a hidden assumption that a disk read should never fail (or relatively rarely compared to network issues).

When I detected this issue, it was a Kubernetes based LLAP environment, where a daemon completely disappeared and a new daemon - running reducer tasks - assumed that it has map outputs locally, which wasn't the case.
This patch can help in container mode as well, as we can assume that a local read should work, and if it's not, the original map output data should be re-generated as soon as possible.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

TEZ-4233.01.patch
12/Sep/20 14:22
22 kB
László Bodor
TEZ-4233.02.patch
22/Sep/20 09:56
69 kB
László Bodor
TEZ-4233.03.patch
22/Sep/20 12:30
70 kB
László Bodor
TEZ-4233.04.patch
23/Sep/20 09:53
95 kB
László Bodor
TEZ-4233.05.patch
23/Sep/20 10:59
95 kB
László Bodor

Issue Links

is duplicated by

TEZ-4400 Tez takes a long time to recover from shuffle data not found errors

Resolved

Activity

People

Assignee:: László Bodor

Reporter:: László Bodor

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 12/Sep/20 14:15

Updated:: 29/Oct/23 18:19

Resolved:: 28/Sep/20 06:38