Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1354

Incremental enhancements to the JobTracker for better scalability

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.22.0
    • Component/s: jobtracker
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Incremental enhancements to the JobTracker include a no-lock version of JT.getTaskCompletion events, no lock on the JT while doing i/o during job-submission and several fixes to cut down configuration parsing during heartbeat-handling.

      Description

      It'd be nice to have the JobTracker object not be locked while accessing the HDFS for reading the jobconf file and while writing the jobinfo file in the submitJob method. We should see if we can avoid taking the lock altogether.

      1. MAPREDUCE-1354_yhadoop20.patch
        5 kB
        Arun C Murthy
      2. MAPREDUCE-1354_yhadoop20.patch
        15 kB
        Arun C Murthy
      3. MAPREDUCE-1354_yhadoop20.patch
        16 kB
        Arun C Murthy
      4. MAPREDUCE-1354_yhadoop20.patch
        19 kB
        Arun C Murthy
      5. MAPREDUCE-1354_yhadoop20.patch
        21 kB
        Arun C Murthy
      6. MAPREDUCE-1354_yhadoop20.patch
        23 kB
        Arun C Murthy
      7. MAPREDUCE-1354_yhadoop20.patch
        24 kB
        Arun C Murthy
      8. mr-1354-y20.patch
        23 kB
        Hemanth Yamijala
      9. mapreduce-1354--2010-03-10.patch
        26 kB
        Dick King
      10. mapreduce-1354--2010-05-13.patch
        26 kB
        Dick King

        Activity

        Hide
        Amareshwari Sriramadasu added a comment -

        The changes suggested above seem good to me, but since the changes are orthogonal to the code of this JIRA I would rather make them a separate change.

        Raised MAPREDUCE-1825 for this.

        Show
        Amareshwari Sriramadasu added a comment - The changes suggested above seem good to me, but since the changes are orthogonal to the code of this JIRA I would rather make them a separate change. Raised MAPREDUCE-1825 for this.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #326 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/326/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #326 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/326/ )
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks Dick!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks Dick!
        Hide
        Dmytro Molkov added a comment -

        Is there any particular reason that only getTaskCompletionEvents dropped the synchronized modifier, but all other job access methods like getCleanupTaskReports, getSetupTaskReports, etc are still syncrhonized, while effectively they are doing a very similar kind of access?

        Show
        Dmytro Molkov added a comment - Is there any particular reason that only getTaskCompletionEvents dropped the synchronized modifier, but all other job access methods like getCleanupTaskReports, getSetupTaskReports, etc are still syncrhonized, while effectively they are doing a very similar kind of access?
        Hide
        Dick King added a comment -

        The regression failure flagged by Hudson, TestJobStatusPersistency , does not repeat, and is hugely unlikely to have been caused by this patch.

        There is no new test because this patch fixes an extremely narrow race condition and that race cannot be induced artificially.

        Show
        Dick King added a comment - The regression failure flagged by Hudson, TestJobStatusPersistency , does not repeat, and is hugely unlikely to have been caused by this patch. There is no new test because this patch fixes an extremely narrow race condition and that race cannot be induced artificially.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12444432/mapreduce-1354--2010-05-13.patch
        against trunk revision 944082.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12444432/mapreduce-1354--2010-05-13.patch against trunk revision 944082. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/187/console This message is automatically generated.
        Hide
        Amareshwari Sriramadasu added a comment -

        Patch looks fine.
        Canceling patch to submit for hudson, as trunk compiles now.

        Show
        Amareshwari Sriramadasu added a comment - Patch looks fine. Canceling patch to submit for hudson, as trunk compiles now.
        Hide
        Dick King added a comment -

        I honored the last two comments by Amareshwari [and ignored the one he invited me to ginore] and this is the patch, but as I write this, trunk does not compile, so I'm not resubmitting this patch just yet.

        Rather than taking the Big Lock, I chose to turn nextJobId into an AtomicInteger .

        I agree that the ugi == null test is dead.

        When trunk comes to build I'll test this patch and Submit it.

        -dk

        Show
        Dick King added a comment - I honored the last two comments by Amareshwari [and ignored the one he invited me to ginore] and this is the patch, but as I write this, trunk does not compile, so I'm not resubmitting this patch just yet. Rather than taking the Big Lock, I chose to turn nextJobId into an AtomicInteger . I agree that the ugi == null test is dead. When trunk comes to build I'll test this patch and Submit it. -dk
        Hide
        Amareshwari Sriramadasu added a comment -

        Some comments on the patch:

        • patch would need changes in TaskDataView.java for corresponding computeNumSlotsPerReduce changes in 0.20 patch. But the same changes are present in MAPREDUCE-1533. So, this comment can be ignored as well.
        • Any reason for making JobTracker.getNewJobID() un-synchrozied? I see deprecated getNewJobId() still synchronized.
        • In JobTracker.submitJob(), the following code change
          +      if (ugi == null) {
          +       ugi = UserGroupInformation.getCurrentUser();
          +      }
          

          is not needed. ugi would never be null.

        Other changes look fine to me.

        Show
        Amareshwari Sriramadasu added a comment - Some comments on the patch: patch would need changes in TaskDataView.java for corresponding computeNumSlotsPerReduce changes in 0.20 patch. But the same changes are present in MAPREDUCE-1533 . So, this comment can be ignored as well. Any reason for making JobTracker.getNewJobID() un-synchrozied? I see deprecated getNewJobId() still synchronized. In JobTracker.submitJob(), the following code change + if (ugi == null ) { + ugi = UserGroupInformation.getCurrentUser(); + } is not needed. ugi would never be null. Other changes look fine to me.
        Hide
        Dick King added a comment -

        The changes suggested above seem good to me, but since the changes are orthogonal to the code of this JIRA I would rather make them a separate change.

        Show
        Dick King added a comment - The changes suggested above seem good to me, but since the changes are orthogonal to the code of this JIRA I would rather make them a separate change.
        Hide
        Amareshwari Sriramadasu added a comment -

        Shall we do the optimizations suggested at comment as part of the trunk patch?
        Or raise a jira to do the same?

        Show
        Amareshwari Sriramadasu added a comment - Shall we do the optimizations suggested at comment as part of the trunk patch? Or raise a jira to do the same?
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12439506/mapreduce-1354--2010-03-10.patch
        against trunk revision 925561.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12439506/mapreduce-1354--2010-03-10.patch against trunk revision 925561. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/47/console This message is automatically generated.
        Hide
        Dick King added a comment -

        The submitted patch is designed to apply to Trunk.

        The 2/25 patch applies to Release 0.20

        -dk

        Show
        Dick King added a comment - The submitted patch is designed to apply to Trunk. The 2/25 patch applies to Release 0.20 -dk
        Hide
        Dick King added a comment -

        After studying this to see if it looked race-free, I checked this with an extensive gridmix run.

        Show
        Dick King added a comment - After studying this to see if it looked race-free, I checked this with an extensive gridmix run.
        Hide
        Hemanth Yamijala added a comment -

        More updated patch for earlier version of hadoop. Not for commit here.

        Show
        Hemanth Yamijala added a comment - More updated patch for earlier version of hadoop. Not for commit here.
        Hide
        Arun C Murthy added a comment -

        Subsuming MAPREDUCE-1495 by incorporating it.

        Show
        Arun C Murthy added a comment - Subsuming MAPREDUCE-1495 by incorporating it.
        Hide
        Hemanth Yamijala added a comment -

        I went over the capacity scheduler changes and changes in JobConf. I also went over the JT changes a little bit (more thorough review was done by Amareshwari). The changes look fine to me as far as I can see.

        Show
        Hemanth Yamijala added a comment - I went over the capacity scheduler changes and changes in JobConf. I also went over the JT changes a little bit (more thorough review was done by Amareshwari). The changes look fine to me as far as I can see.
        Hide
        Arun C Murthy added a comment -

        Some discussion has been happening here: MAPREDUCE-1495. I'm going to merge these issues since they affect the same code on similar lines.

        Show
        Arun C Murthy added a comment - Some discussion has been happening here: MAPREDUCE-1495 . I'm going to merge these issues since they affect the same code on similar lines.
        Hide
        Arun C Murthy added a comment -

        As per Rajesh's suggestion, I've fixed JT.getTaskCompletionEvents to be completely lock-free on the JT by making JT.jobs a synchronized map. However, I've left JT.(running|completed|failed)Jobs broken as currently to cut down risk.

        Show
        Arun C Murthy added a comment - As per Rajesh's suggestion, I've fixed JT.getTaskCompletionEvents to be completely lock-free on the JT by making JT.jobs a synchronized map. However, I've left JT.(running|completed|failed)Jobs broken as currently to cut down risk.
        Hide
        Amareshwari Sriramadasu added a comment -

        Other than the above optimizations I suggested in my previous comment, the code changes look fine.
        I verified all the accesses to JobInProgress.conf and JobInProgress.getJobConf(); verified that synchronized methods in JobInProgress are not called if the job is not initialized;
        JobInProgress constructor calls jobtracker.getJobTrackerMachine(), jobtracker.getInfoPort(), jobtracker.getSystemDirectoryForJob(), jobtracker.getNumTaskCacheLevels(). All these are getters and their values are initialized in JobTracker's constructor. So, calling them without JobTracker lock is fine. JobInProgress constructor calls jobtracker.getInstrumentation().addPrepJob() also. The method addPrepJob() is a synchronized method in implementation of Instrumentation. Thus, calling JobInProgress constructor withour JobTracker lock is fine.

        Show
        Amareshwari Sriramadasu added a comment - Other than the above optimizations I suggested in my previous comment, the code changes look fine. I verified all the accesses to JobInProgress.conf and JobInProgress.getJobConf(); verified that synchronized methods in JobInProgress are not called if the job is not initialized; JobInProgress constructor calls jobtracker.getJobTrackerMachine(), jobtracker.getInfoPort(), jobtracker.getSystemDirectoryForJob(), jobtracker.getNumTaskCacheLevels(). All these are getters and their values are initialized in JobTracker's constructor. So, calling them without JobTracker lock is fine. JobInProgress constructor calls jobtracker.getInstrumentation().addPrepJob() also. The method addPrepJob() is a synchronized method in implementation of Instrumentation. Thus, calling JobInProgress constructor withour JobTracker lock is fine.
        Hide
        Rajesh Balamohan added a comment -

        Plz ignore the previous comment. Had a discussion with Hemanth and will try out synchronized(jobs) on trunk.

        Show
        Rajesh Balamohan added a comment - Plz ignore the previous comment. Had a discussion with Hemanth and will try out synchronized(jobs) on trunk.
        Hide
        Hemanth Yamijala added a comment -

        It has to be on "synchronized (jobs) {"

        Sigh ! Access to the jobs data structure is completely messed up. I think for this patch we should retain the inconsistency, but reduce the scope of its damage - which I suppose is what Arun has done.

        Show
        Hemanth Yamijala added a comment - It has to be on "synchronized (jobs) {" Sigh ! Access to the jobs data structure is completely messed up. I think for this patch we should retain the inconsistency, but reduce the scope of its damage - which I suppose is what Arun has done.
        Hide
        Amareshwari Sriramadasu added a comment -

        Some more optimizations that can be done(though these do not effect scheduling code path):

        • JobInProgress.getJobConf().getUser() is called from JobTracker.submitJob() and QueueManager.hasAccess(). Those calls can also be changed to job.getUser().
        • JobInProgress.finishedMaps() and finishedReduces() are synchronized; they are called from jobqueue_details.jsp which iterates through all jobs. If any job is in initialization, this page doesn't come up until the initialization finishes.
        • JobTracker.setJobPriority() should also check whether the job is initialized or not.
        Show
        Amareshwari Sriramadasu added a comment - Some more optimizations that can be done(though these do not effect scheduling code path): JobInProgress.getJobConf().getUser() is called from JobTracker.submitJob() and QueueManager.hasAccess(). Those calls can also be changed to job.getUser(). JobInProgress.finishedMaps() and finishedReduces() are synchronized; they are called from jobqueue_details.jsp which iterates through all jobs. If any job is in initialization, this page doesn't come up until the initialization finishes. JobTracker.setJobPriority() should also check whether the job is initialized or not.
        Hide
        Rajesh Balamohan added a comment -

        In the latest patch, getTaskCompletionEvents is using synchronized(this) I believe.

        It has to be on "synchronized (jobs) {"

        public TaskCompletionEvent[] getTaskCompletionEvents(
        JobID jobid, int fromEventId, int maxEvents) throws IOException{

        JobInProgress job = null;
        synchronized (jobs)

        { job = this.jobs.get(jobid); }

        if (null != job)

        { return isJobInited(job) ? job.getTaskCompletionEvents(fromEventId, maxEvents) : TaskCompletionEvent.EMPTY_ARRAY; }

        return completedJobStatusStore.readJobTaskCompletionEvents(jobid, fromEventId, maxEvents);
        }

        Show
        Rajesh Balamohan added a comment - In the latest patch, getTaskCompletionEvents is using synchronized(this) I believe. It has to be on "synchronized (jobs) {" public TaskCompletionEvent[] getTaskCompletionEvents( JobID jobid, int fromEventId, int maxEvents) throws IOException{ JobInProgress job = null; synchronized (jobs) { job = this.jobs.get(jobid); } if (null != job) { return isJobInited(job) ? job.getTaskCompletionEvents(fromEventId, maxEvents) : TaskCompletionEvent.EMPTY_ARRAY; } return completedJobStatusStore.readJobTaskCompletionEvents(jobid, fromEventId, maxEvents); }
        Hide
        Arun C Murthy added a comment -

        Updated patch to incorporate MAPREDUCE-1495 alongwith fixing pretty much all uses of JobConf inside the scheduling code in JIP and the CS.

        Show
        Arun C Murthy added a comment - Updated patch to incorporate MAPREDUCE-1495 alongwith fixing pretty much all uses of JobConf inside the scheduling code in JIP and the CS.
        Hide
        Amareshwari Sriramadasu added a comment -

        Some comments on the patch:
        1. JobInProgress constructor calls methods like JobTracker.getSystemDirectoryForJob(). This method is called with JobTracker lock sometimes, JobTracker lock followed by JobInProgress lock sometimes and without any lock in this case. I think this should not effect any, but we should verify all the locking orders for all the back calls from JobInProgress constructor to JobTracker.
        2. Unused import for org.apache.tools.ant.taskdefs.condition.HasMethod in JobInProgress
        3. Variable JobTracker.EMPTY_TASK_DIAGNOSTICS is not used anywhere.

        Show
        Amareshwari Sriramadasu added a comment - Some comments on the patch: 1. JobInProgress constructor calls methods like JobTracker.getSystemDirectoryForJob(). This method is called with JobTracker lock sometimes, JobTracker lock followed by JobInProgress lock sometimes and without any lock in this case. I think this should not effect any, but we should verify all the locking orders for all the back calls from JobInProgress constructor to JobTracker. 2. Unused import for org.apache.tools.ant.taskdefs.condition.HasMethod in JobInProgress 3. Variable JobTracker.EMPTY_TASK_DIAGNOSTICS is not used anywhere.
        Hide
        Arun C Murthy added a comment -

        Updated, fixed the framework to use the existing hasSpeculative

        {Maps|Reduces}

        rather than fetch it from the JobConf each time.

        Show
        Arun C Murthy added a comment - Updated, fixed the framework to use the existing hasSpeculative {Maps|Reduces} rather than fetch it from the JobConf each time.
        Hide
        Arun C Murthy added a comment -

        Updated patch to cache JobConf.getMemoryFor

        {Map|Reduce}

        Task.

        Show
        Arun C Murthy added a comment - Updated patch to cache JobConf.getMemoryFor {Map|Reduce} Task.
        Hide
        Arun C Murthy added a comment -

        Updated patch to fix Math.ceil via integer division in the CapacityScheduler.

        Show
        Arun C Murthy added a comment - Updated patch to fix Math.ceil via integer division in the CapacityScheduler.
        Hide
        Todd Lipcon added a comment -

        Arun: I don't see the ceil to rint changes in this patch. Did you upload the right one?

        Show
        Todd Lipcon added a comment - Arun: I don't see the ceil to rint changes in this patch. Did you upload the right one?
        Hide
        Arun C Murthy added a comment -

        Updated patch.

        Show
        Arun C Murthy added a comment - Updated patch.
        Hide
        Arun C Murthy added a comment -

        Some more things which came up during more investigations: Math.ceil is fairly expensive, we use it in 2 places. One we can do away by using the cached value of JobInProgress.slotsPer

        {Map|Reduce}

        , the other with using rint (thanks to Chris for the suggestion).

        Show
        Arun C Murthy added a comment - Some more things which came up during more investigations: Math.ceil is fairly expensive, we use it in 2 places. One we can do away by using the cached value of JobInProgress.slotsPer {Map|Reduce} , the other with using rint (thanks to Chris for the suggestion).
        Hide
        Amar Kamat added a comment -

        Job initialization (job.split localization) can also take up considerable amount of time. Hence we should avoid access to any getter calls to JobInProgress while the initialization is in progress. Following are the other methods that first lock the JobTracker and then JobInProgress potentially locking up the JobTracker during the job initialization.

        • getMapTaskReports()
        • getReduceTaskReports()
        • getCleanupTaskReports()
        • getSetupTaskReports()
        • getTaskCompletionEvents()
        • getTaskDiagnostics()
        • setJobPriority()
        Show
        Amar Kamat added a comment - Job initialization (job.split localization) can also take up considerable amount of time. Hence we should avoid access to any getter calls to JobInProgress while the initialization is in progress. Following are the other methods that first lock the JobTracker and then JobInProgress potentially locking up the JobTracker during the job initialization. getMapTaskReports() getReduceTaskReports() getCleanupTaskReports() getSetupTaskReports() getTaskCompletionEvents() getTaskDiagnostics() setJobPriority()
        Hide
        Hemanth Yamijala added a comment -

        One thing that was noticed was that the getCounters call in JobInProgress is synchronized. The wrapper call to getCounters in Jobtracker acquires a lock on the JT and then calls JobInProgress.getCounters. The problem is that if the job is being initialized under initTasks, then the jobtracker lock can get held up. We saw an instance of this on our clusters. To avoid this case, one solution could be to check if the job being queried is inited. This pattern is used in getTaskCompletionEvents.

        Show
        Hemanth Yamijala added a comment - One thing that was noticed was that the getCounters call in JobInProgress is synchronized. The wrapper call to getCounters in Jobtracker acquires a lock on the JT and then calls JobInProgress.getCounters. The problem is that if the job is being initialized under initTasks, then the jobtracker lock can get held up. We saw an instance of this on our clusters. To avoid this case, one solution could be to check if the job being queried is inited. This pattern is used in getTaskCompletionEvents.
        Hide
        Arun C Murthy added a comment -

        Illustrative patch for yhadoop20, not to be committed.

        Show
        Arun C Murthy added a comment - Illustrative patch for yhadoop20, not to be committed.
        Hide
        Arun C Murthy added a comment -

        JobTracker.submitJob also forks a DU and writes to it's local-disk while holding the lock.

        Show
        Arun C Murthy added a comment - JobTracker.submitJob also forks a DU and writes to it's local-disk while holding the lock.
        Hide
        Devaraj Das added a comment -

        We should see if we can avoid taking the lock altogether.

        This may require major changes.. Maybe we should just make the locking more granular within the method for the time being.

        Show
        Devaraj Das added a comment - We should see if we can avoid taking the lock altogether. This may require major changes.. Maybe we should just make the locking more granular within the method for the time being.

          People

          • Assignee:
            Dick King
            Reporter:
            Devaraj Das
          • Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development