Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4088

Task stuck in JobLocalizer prevented other tasks on the same node from committing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.20.205.0
    • 1.1.0
    • mrv1
    • None

    Description

      We saw that as a result of HADOOP-6963, one task was stuck in this

      Thread 23668: (state = IN_NATIVE)

      • java.io.UnixFileSystem.getBooleanAttributes0(java.io.File) @bci=0 (Compiled frame; information may be imprecise)
      • java.io.UnixFileSystem.getBooleanAttributes(java.io.File) @bci=2, line=228 (Compiled frame)
      • java.io.File.exists() @bci=20, line=733 (Compiled frame)
      • org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=3, line=446 (Compiled frame)
      • org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 (Compiled frame)
      • org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 (Compiled frame)
        ....
        .... TONS MORE OF THIS SAME LINE
      • org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 (Compiled frame)
        .....
        .....
      • org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 (Compiled frame)
      • org.apache.hadoop.fs.FileUtil.getDU(java.io.File) @bci=52, line=455 (Interpreted frame)
        ne=451 (Interpreted frame)
      • org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration, java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean) @bci=150, line=324 (Interpreted frame)
      • org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration) @bci=40, line=349 (Interpreted frame) 51, line=383 (Interpreted frame)
      • org.apache.hadoop.mapred.JobLocalizer.runSetup(java.lang.String, java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=46, line=477 (Interpreted frame)
      • org.apache.hadoop.mapred.JobLocalizer$3.run() @bci=20, line=534 (Interpreted frame)
      • org.apache.hadoop.mapred.JobLocalizer$3.run() @bci=1, line=531 (Interpreted frame)
      • java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Interpreted frame)
      • javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=396 (Interpreted frame)
      • org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1082 (Interpreted frame)
      • org.apache.hadoop.mapred.JobLocalizer.main(java.lang.String[]) @bci=266, line=530 (Interpreted frame)

      While all other tasks on the same node were stuck in
      Thread 32141: (state = BLOCKED)

      • java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
      • org.apache.hadoop.mapred.Task.commit(org.apache.hadoop.mapred.TaskUmbilicalProtocol, org.apache.hadoop.mapred.Task$TaskReporter, org.apache.hadoop.mapreduce.OutputCommitter) @bci=24, line=980 (Compiled frame)
      • org.apache.hadoop.mapred.Task.done(org.apache.hadoop.mapred.TaskUmbilicalProtocol, org.apache.hadoop.mapred.Task$TaskReporter) @bci=146, line=871 (Interpreted frame)
      • org.apache.hadoop.mapred.ReduceTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=470, line=423 (Interpreted frame)
      • org.apache.hadoop.mapred.Child$4.run() @bci=29, line=255 (Interpreted frame)
      • java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Interpreted frame)
      • javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=396 (Interpreted frame)
      • org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1082 (Interpreted frame)
      • org.apache.hadoop.mapred.Child.main(java.lang.String[]) @bci=738, line=249 (Interpreted frame)

      This should never happen. A stuck task should never prevent other tasks from different jobs on the same node from committing.

      Attachments

        1. MAPREDUCE-4088.patch
          2 kB
          Ravi Prakash
        2. MAPREDUCE-4088.patch
          4 kB
          Ravi Prakash
        3. MAPREDUCE-4088.branch-1.patch
          4 kB
          Ravi Prakash
        4. MAPREDUCE-4088.branch-1.patch
          4 kB
          Ravi Prakash

        Issue Links

          Activity

            People

              raviprak Ravi Prakash
              raviprak Ravi Prakash
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: