Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3656

Sort job on 350 scale is consistently failing with latest MRV2 code

    XMLWordPrintableJSON

Details

    • Reviewed
    • Fixed a race condition in MR AM which is failing the sort benchmark consistently.

    Description

      With the code checked out on last two days.
      Sort Job on 350 node scale with 16800 maps and 680 reduces consistently failing for around last 6 runs
      When around 50% of maps are completed, suddenly job jumps to failed state.
      On looking at NM log, found RM sent Stop Container Request to NM for AM container.
      But at INFO level from RM log not able find why RM is killing AM when job is not killed manually.
      One thing found common on failed AM logs is -:
      org.apache.hadoop.yarn.state.InvalidStateTransitonException
      With with different.
      For e.g. One log says -:

      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_UPDATE at ASSIGNED 
      

      Whereas other logs says -:

      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: JOB_COUNTER_UPDATE at ERROR
      

      Attachments

        1. MR3656.txt
          10 kB
          Siddharth Seth
        2. MR3656.txt
          11 kB
          Siddharth Seth
        3. MR3656.txt
          11 kB
          Siddharth Seth

        Activity

          People

            sseth Siddharth Seth
            karams Karam Singh
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: