Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3186

User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 0.23.0
    • 0.23.0
    • mrv2
    • linux

    • Reviewed
    • Hide
      New Yarn configuration property:

      Name: yarn.app.mapreduce.am.scheduler.connection.retries
      Description: Number of times AM should retry to contact RM if connection is lost.
      Show
      New Yarn configuration property: Name: yarn.app.mapreduce.am.scheduler.connection.retries Description: Number of times AM should retry to contact RM if connection is lost.

    Description

      If the resource manager is restarted while the job execution is in progress, the job is getting hanged.
      UI shows the job as running.
      In the RM log, it is throwing an error "ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
      In the console MRAppMaster and Runjar processes are not getting killed

      Attachments

        1. MAPREDUCE-3186.v1.txt
          5 kB
          Eric Payne
        2. MAPREDUCE-3186.v2.txt
          10 kB
          Eric Payne
        3. MR3186_v3.txt
          12 kB
          Siddharth Seth

        Issue Links

          Activity

            People

              epayne Eric Payne
              ramgopalnaali Ramgopal N
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: