Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Fetch failures can be a result of network issue or disk issue. Currently, AM doesn't know about whether the original input read error happened because of a local fetch failure or not. I think if a map output was reported as a subject of local fetch failure, AM should respond earlier, and blame it as soon as possible. Here is a hidden assumption that a disk read should never fail (or relatively rarely compared to network issues).
When I detected this issue, it was a Kubernetes based LLAP environment, where a daemon completely disappeared and a new daemon - running reducer tasks - assumed that it has map outputs locally, which wasn't the case.
This patch can help in container mode as well, as we can assume that a local read should work, and if it's not, the original map output data should be re-generated as soon as possible.
Attachments
Attachments
Issue Links
- is duplicated by
-
TEZ-4400 Tez takes a long time to recover from shuffle data not found errors
- Resolved