[MAPREDUCE-3656] Sort job on 350 scale is consistently failing with latest MRV2 code - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.23.1
Fix Version/s: 0.23.1
Component/s: applicationmaster, mrv2, resourcemanager
Labels:
None

Hadoop Flags:

Reviewed
Release Note:
Fixed a race condition in MR AM which is failing the sort benchmark consistently.

Description

With the code checked out on last two days.
Sort Job on 350 node scale with 16800 maps and 680 reduces consistently failing for around last 6 runs
When around 50% of maps are completed, suddenly job jumps to failed state.
On looking at NM log, found RM sent Stop Container Request to NM for AM container.
But at INFO level from RM log not able find why RM is killing AM when job is not killed manually.
One thing found common on failed AM logs is -:
org.apache.hadoop.yarn.state.InvalidStateTransitonException
With with different.
For e.g. One log says -:

org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_UPDATE at ASSIGNED

Whereas other logs says -:

org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_COUNTER_UPDATE at ERROR

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MR3656.txt
13/Jan/12 19:16
11 kB
Siddharth Seth
MR3656.txt
12/Jan/12 21:23
11 kB
Siddharth Seth
MR3656.txt
11/Jan/12 23:50
10 kB
Siddharth Seth

Issue Links

fixes

MAPREDUCE-7481 Invalid transitions for TaskAttemptImpl

Open

Activity

People

Assignee:: Siddharth Seth

Reporter:: Karam Singh

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 11/Jan/12 15:45

Updated:: 16/Sep/24 10:46

Resolved:: 13/Jan/12 21:33