Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3656

Sort job on 350 scale is consistently failing with latest MRV2 code

    Details

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Fixed a race condition in MR AM which is failing the sort benchmark consistently.

      Description

      With the code checked out on last two days.
      Sort Job on 350 node scale with 16800 maps and 680 reduces consistently failing for around last 6 runs
      When around 50% of maps are completed, suddenly job jumps to failed state.
      On looking at NM log, found RM sent Stop Container Request to NM for AM container.
      But at INFO level from RM log not able find why RM is killing AM when job is not killed manually.
      One thing found common on failed AM logs is -:
      org.apache.hadoop.yarn.state.InvalidStateTransitonException
      With with different.
      For e.g. One log says -:

      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_UPDATE at ASSIGNED 
      

      Whereas other logs says -:

      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_COUNTER_UPDATE at ERROR
      
      1. MR3656.txt
        11 kB
        Siddharth Seth
      2. MR3656.txt
        11 kB
        Siddharth Seth
      3. MR3656.txt
        10 kB
        Siddharth Seth

        Activity

          People

          • Assignee:
            Siddharth Seth
            Reporter:
            Karam Singh
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development