With the code checked out on last two days.
Sort Job on 350 node scale with 16800 maps and 680 reduces consistently failing for around last 6 runs
When around 50% of maps are completed, suddenly job jumps to failed state.
On looking at NM log, found RM sent Stop Container Request to NM for AM container.
But at INFO level from RM log not able find why RM is killing AM when job is not killed manually.
One thing found common on failed AM logs is -:
With with different.
For e.g. One log says -:
Whereas other logs says -: