Details
-
Bug
-
Status: Patch Available
-
Critical
-
Resolution: Unresolved
-
3.2.1
-
None
-
None
Description
when rolling restart nodemanager, some mapreduce job will exit because of unhandle TA_TOO_MANY_FETCH_FAILURE event
details:
if task stay in SUCCEEDED state, now reciveice TA_TOO_MANY_FETCH_FAILURE event,AM will handle this situation correct,but if stay in SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event YARN-1469 MAPREDUCE-7240 MAPREDUCE-7249 MAPREDUCE-7240
reason:
when map task send done rpc to AM, AM will Transition this task to
SUCCESS_FINISHING_CONTAINER state, and add this task to
mapAttemptCompletionEvents List, when reduce send
getMapAttemptCompletionEvents rpc to get the complete map, the task stay in SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or stop,many reducer task will shuffle fail,and report to AM, AM will send TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle TA_TOO_MANY_FETCH_FAILURE event,AM will exit.
i found isusses to resolve this problem,but not cover all situation.
The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice TA_TOO_MANY_FETCH_FAILURE event,like (SUCCEEDED,SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,FAILED,KILL_CONTAINER_CLEANUP)
In hadoop 3.2.1, only SUCCEEDED,FAILED AND KILLED state can handle TA_TOO_MANY_FETCH_FAILURE event, and some jira to fix SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,KILLED,but KILL_CONTAINER_CLEANUP,KILL_TASK_CLEANUP also should to handle TA_TOO_MANY_FETCH_FAILURE event
Attachments
Attachments
Issue Links
- is related to
-
MAPREDUCE-7240 Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER' cause job error
- Resolved
-
MAPREDUCE-7249 Invalid event TA_TOO_MANY_FETCH_FAILURE at SUCCESS_CONTAINER_CLEANUP causes job failure
- Resolved
-
YARN-1469 ApplicationMaster crash cause the TaskAttemptImpl couldn't handle the TA_TOO_MANY_FETCH_FAILURE at KILLED
- Resolved