Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3186

User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: mrv2
    • Labels:
    • Environment:

      linux

    • Target Version/s:
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      New Yarn configuration property:

      Name: yarn.app.mapreduce.am.scheduler.connection.retries
      Description: Number of times AM should retry to contact RM if connection is lost.
      Show
      New Yarn configuration property: Name: yarn.app.mapreduce.am.scheduler.connection.retries Description: Number of times AM should retry to contact RM if connection is lost.

      Description

      If the resource manager is restarted while the job execution is in progress, the job is getting hanged.
      UI shows the job as running.
      In the RM log, it is throwing an error "ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
      In the console MRAppMaster and Runjar processes are not getting killed

        Attachments

        1. MR3186_v3.txt
          12 kB
          Siddharth Seth
        2. MAPREDUCE-3186.v2.txt
          10 kB
          Eric Payne
        3. MAPREDUCE-3186.v1.txt
          5 kB
          Eric Payne

          Issue Links

            Activity

              People

              • Assignee:
                eepayne Eric Payne
                Reporter:
                ramgopalnaali Ramgopal N
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: