Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Duplicate
-
None
-
None
-
None
-
None
Description
Shuffle receives batches of Events to process from the AM. The way these events are sent over to the ShuffleHandlers and the way they're processed - it's possible that Shuffle will start fetching data from an Event, which is to be subsequently marked as failed (via an InputFailedEvent)
1) The AM sends events in batches. An InputFailedEvent for a specific Input may not be part of the same batch which contained the original event which is being marked bad.
2) The ShuffleEventHandler processes the events in each batch one event at a time - so even if the InputFailedEvent follows - it's possible for Shuffle to start fetching data from a Failed Input.
The AM needs to change to invalidate Inputs up front - so that related events don't span batches. Alternately, it needs to apply the InputFailedEvent to the original event being sent.
The Shuffle itself should process a batch update as a batch - that would prevent fetchers from starting early even though there may be additional events for the same host.
Attachments
Issue Links
- is duplicated by
-
TEZ-2599 Dont send obsoleted data movement events to tasks
- Closed