Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4843

When using DefaultTaskController, JobLocalizer not thread safe

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 1.1.1
    • Fix Version/s: 1.2.0
    • Component/s: tasktracker
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      In our cluster, some times job will failed due to below exception:
      2012-12-03 23:11:54,811 WARN org.apache.hadoop.mapred.TaskTracker: Error initializing attempt_201212031626_1115_r_000023_0:
      org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/$username/jobcache/job_201212031626_1115/job.xml in any of the configured local directories
      at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:424)
      at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
      at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1175)
      at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1058)
      at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2213)

      The root cause is JobLocalizer is not thread safe.
      In DefaultTaskController.initializeJob method:
      JobLocalizer localizer = new JobLocalizer((JobConf)getConf(), user, jobid);
      but in JobLocalizer, it just simply keep the reference of the conf.
      When two TaskLauncher threads(mapLauncher and reduceLauncher) try to initializeJob at same time, it will have two JobLocalizer, but only one conf instance.
      So some times ttConf.setStrings(JOB_LOCAL_CTXT, localDirs) will reset previous job's conf.
      Then it will cause the previous job's job.xml stored at another user's dir.

      1. mr-4843.patch
        3 kB
        Karthik Kambatla
      2. MAPREDUCE-4843-branch-1.1.patch
        0.9 kB
        zhaoyunjiong

        Issue Links

          Activity

          zhaoyunjiong created issue -
          Hide
          zhaoyunjiong added a comment -

          The fix is very simple:

          diff --git a/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java b/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java
          index 0802b03..625face 100644
          — a/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java
          +++ b/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java
          @@ -108,7 +108,7 @@ public class JobLocalizer

          { throw new IOException("Cannot initialize for null jobid"); }


          this.jobid = jobid;

          • this.ttConf = ttConf;
            + this.ttConf = new JobConf(ttConf);
            lfs = FileSystem.getLocal(ttConf).getRaw();
            this.localDirs = createPaths(user, localDirs);
            ttConf.setStrings(JOB_LOCAL_CTXT, localDirs);
          Show
          zhaoyunjiong added a comment - The fix is very simple: diff --git a/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java b/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java index 0802b03..625face 100644 — a/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java +++ b/src/mapred/org/apache/hadoop/mapred/JobLocalizer.java @@ -108,7 +108,7 @@ public class JobLocalizer { throw new IOException("Cannot initialize for null jobid"); } this.jobid = jobid; this.ttConf = ttConf; + this.ttConf = new JobConf(ttConf); lfs = FileSystem.getLocal(ttConf).getRaw(); this.localDirs = createPaths(user, localDirs); ttConf.setStrings(JOB_LOCAL_CTXT, localDirs);
          zhaoyunjiong made changes -
          Field Original Value New Value
          Description In our cluster, some times job will failed due to below exception:
          Error initializing attempt_201210181806_18566_r_000376_0: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/$username/jobcache/job_201210181806_18566/job.xml in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:424)

          The root cause is JobLocalizer is not thread safe.
          In DefaultTaskController.initializeJob method:
               JobLocalizer localizer = new JobLocalizer((JobConf)getConf(), user, jobid);
          but in JobLocalizer, it just simply keep the reference of the conf.
          When two TaskLauncher threads(mapLauncher and reduceLauncher) try to initializeJob at same time, it will have two JobLocalizer, but one conf instance.
          So some times ttConf.setStrings(JOB_LOCAL_CTXT, localDirs) will reset previous job's conf.
          It will cause the previous job's job.xml stored at another user's dir.
          In our cluster, some times job will failed due to below exception:
          2012-12-03 23:11:54,811 WARN org.apache.hadoop.mapred.TaskTracker: Error initializing attempt_201212031626_1115_r_000023_0:
          org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/$username/jobcache/job_201212031626_1115/job.xml in any of the configured local directories
          at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:424)
          at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
          at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1175)
          at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1058)
          at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2213)

          The root cause is JobLocalizer is not thread safe.
          In DefaultTaskController.initializeJob method:
               JobLocalizer localizer = new JobLocalizer((JobConf)getConf(), user, jobid);
          but in JobLocalizer, it just simply keep the reference of the conf.
          When two TaskLauncher threads(mapLauncher and reduceLauncher) try to initializeJob at same time, it will have two JobLocalizer, but only one conf instance.
          So some times ttConf.setStrings(JOB_LOCAL_CTXT, localDirs) will reset previous job's conf.
          Then it will cause the previous job's job.xml stored at another user's dir.
          Hide
          zhaoyunjiong added a comment -

          Above patch is not working. I'm working on new patch.

          Show
          zhaoyunjiong added a comment - Above patch is not working. I'm working on new patch.
          Hide
          zhaoyunjiong added a comment -

          Update patch.

          Show
          zhaoyunjiong added a comment - Update patch.
          zhaoyunjiong made changes -
          Attachment MAPREDUCE-4843-branch-1.1.patch [ 12556081 ]
          Hide
          zhaoyunjiong added a comment -

          Testing patch

          Show
          zhaoyunjiong added a comment - Testing patch
          zhaoyunjiong made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12556081/MAPREDUCE-4843-branch-1.1.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3095//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12556081/MAPREDUCE-4843-branch-1.1.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3095//console This message is automatically generated.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          zhaoyunjiong The patch looks good. Can you post a patch against trunk for QA to be able to apply it. Also, I was wondering if it would be possible to add a test?

          Show
          Karthik Kambatla (Inactive) added a comment - zhaoyunjiong The patch looks good. Can you post a patch against trunk for QA to be able to apply it. Also, I was wondering if it would be possible to add a test?
          Hide
          zhaoyunjiong added a comment -

          No need for trunk. In hadoop 2.0, the problem doesn't exist.
          It's very difficult to test a thread safe problem, even it's not thread safe, in most case it will pass it.

          Show
          zhaoyunjiong added a comment - No need for trunk. In hadoop 2.0, the problem doesn't exist. It's very difficult to test a thread safe problem, even it's not thread safe, in most case it will pass it.
          Hide
          Karthik Kambatla (Inactive) added a comment -

          My bad - read the branch name wrong. I applied the patch locally, and verified that the tests that directly use DefaultTaskController pass - TestTaskTrackerLocalization, TestJvmManager, TestTaskEnvironment.

          +1

          Show
          Karthik Kambatla (Inactive) added a comment - My bad - read the branch name wrong. I applied the patch locally, and verified that the tests that directly use DefaultTaskController pass - TestTaskTrackerLocalization, TestJvmManager, TestTaskEnvironment. +1
          Karthik Kambatla (Inactive) made changes -
          Link This issue is related to MAPREDUCE-4964 [ MAPREDUCE-4964 ]
          Karthik Kambatla (Inactive) made changes -
          Assignee Karthik Kambatla [ kkambatl ]
          Hide
          Karthik Kambatla (Inactive) added a comment -

          Uploading the patch from MAPREDUCE-4964 as that solves this issue in a simpler/cleaner way. The discussion on that JIRA has all the details.

          Applied the patch to latest branch-1 and it applies cleanly. Also, verified TestJobLocalizer passes.

          Show
          Karthik Kambatla (Inactive) added a comment - Uploading the patch from MAPREDUCE-4964 as that solves this issue in a simpler/cleaner way. The discussion on that JIRA has all the details. Applied the patch to latest branch-1 and it applies cleanly. Also, verified TestJobLocalizer passes.
          Karthik Kambatla (Inactive) made changes -
          Attachment mr-4843.patch [ 12567889 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12567889/mr-4843.patch
          against trunk revision .

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3298//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567889/mr-4843.patch against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3298//console This message is automatically generated.
          Hide
          Alejandro Abdelnur added a comment -

          +1. As per discussion in MAPREDUCE-4964 the latest patch seems a better way of doing it.

          Show
          Alejandro Abdelnur added a comment - +1. As per discussion in MAPREDUCE-4964 the latest patch seems a better way of doing it.
          Hide
          Alejandro Abdelnur added a comment -

          Thanks Karthik. Committed to branch-1. Arun, thanks for double checking on this one.

          Show
          Alejandro Abdelnur added a comment - Thanks Karthik. Committed to branch-1. Arun, thanks for double checking on this one.
          Alejandro Abdelnur made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Fix Version/s 1.2.0 [ 12321661 ]
          Resolution Fixed [ 1 ]
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop 1.2.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop 1.2.0.
          Matt Foley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Gavin made changes -
          Assignee Karthik Kambatla [ kkambatl ] Karthik Kambatla [ kasha ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          19h 50m 1 zhaoyunjiong 05/Dec/12 09:30
          Patch Available Patch Available Resolved Resolved
          61d 12h 57m 1 Alejandro Abdelnur 04/Feb/13 22:27
          Resolved Resolved Closed Closed
          99d 6h 48m 1 Matt Foley 15/May/13 06:15

            People

            • Assignee:
              Karthik Kambatla
              Reporter:
              zhaoyunjiong
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development