[MAPREDUCE-7264] overall reduction of ApplicationMaster exit because of unhandled TA_TOO_MANY_FETCH_FAILURE event - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Patch Available
Priority: Critical
Resolution: Unresolved
Affects Version/s: 3.2.1
Fix Version/s: None
Component/s: applicationmaster
Labels:
None

Description

when rolling restart nodemanager, some mapreduce job will exit because of unhandle TA_TOO_MANY_FETCH_FAILURE event

details:
if task stay in SUCCEEDED state, now reciveice TA_TOO_MANY_FETCH_FAILURE event,AM will handle this situation correct,but if stay in SUCCESS_FINISHING_CONTAINER or some other state,will exit by invalid event YARN-1469 MAPREDUCE-7240 MAPREDUCE-7249 MAPREDUCE-7240
reason:
when map task send done rpc to AM, AM will Transition this task to

SUCCESS_FINISHING_CONTAINER state, and add this task to

mapAttemptCompletionEvents List, when reduce send

getMapAttemptCompletionEvents rpc to get the complete map, the task stay in SUCCESS_FINISHING_CONTAINER state will return. but if now,NM is restart or stop,many reducer task will shuffle fail,and report to AM, AM will send TA_TOO_MANY_FETCH_FAILURE event,if map task state cannot handle TA_TOO_MANY_FETCH_FAILURE event,AM will exit.

i found isusses to resolve this problem,but not cover all situation.

The state Transition from SUCCESS_FINISHING_CONTAINER will reciveice TA_TOO_MANY_FETCH_FAILURE event，like (SUCCEEDED,SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,FAILED,KILL_CONTAINER_CLEANUP)

In hadoop 3.2.1, only SUCCEEDED,FAILED AND KILLED state can handle TA_TOO_MANY_FETCH_FAILURE event, and some jira to fix SUCCESS_CONTAINER_CLEANUP,SUCCESS_FINISHING_CONTAINER,KILLED,but KILL_CONTAINER_CLEANUP,KILL_TASK_CLEANUP also should to handle TA_TOO_MANY_FETCH_FAILURE event

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-7264-branch-3.2.001.patch
19/Feb/20 11:44
2 kB
tuyu

Issue Links

is related to

MAPREDUCE-7240 Exception ' Invalid event: TA_TOO_MANY_FETCH_FAILURE at SUCCESS_FINISHING_CONTAINER' cause job error

Resolved

MAPREDUCE-7249 Invalid event TA_TOO_MANY_FETCH_FAILURE at SUCCESS_CONTAINER_CLEANUP causes job failure

Resolved

YARN-1469 ApplicationMaster crash cause the TaskAttemptImpl couldn't handle the TA_TOO_MANY_FETCH_FAILURE at KILLED

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: tuyu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 19/Feb/20 09:35

Updated:: 09/Apr/20 17:26