Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.20.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The job init thread currently initializes one job at a time. However, this is a lengthy and partly IO-bound process because all of the job's block locations need to be resolved through the namenode and a map of them needs to be built. It can take tens of seconds. As a result, the cluster sometimes initializes jobs too slowly for full utilization to be achieved, if there are many small jobs queued up. It would be better to have a pool of threads that initialize multiple jobs in parallel. One thing to be careful of, however, is not causing deadlocks or holding locks for too long in these threads.

      1. parallel-job-init-v1.patch
        5 kB
        Matei Zaharia
      2. hadoop-4664-v1.patch
        13 kB
        Jothi Padmanabhan
      3. hadoop-4664-v2.patch
        14 kB
        Jothi Padmanabhan
      4. hadoop-4664-v3.patch
        14 kB
        Jothi Padmanabhan
      5. hadoop-4664-v4.patch
        14 kB
        Jothi Padmanabhan

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/)
          . Introduces multiple job initialization threads, where the number of threads are configurable via mapred.jobinit.threads. Contributed by (Matei Zaharia and Jothi Padmanabhan.

          Show
          Hudson added a comment - Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/ ) . Introduces multiple job initialization threads, where the number of threads are configurable via mapred.jobinit.threads. Contributed by (Matei Zaharia and Jothi Padmanabhan.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12402051/hadoop-4664-v4.patch
          against trunk revision 752927.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12402051/hadoop-4664-v4.patch against trunk revision 752927. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/console This message is automatically generated.
          Hide
          Devaraj Das added a comment - - edited

          I just committed this to the 0.20 branch and the trunk. Thanks Matei and Jothi!

          Show
          Devaraj Das added a comment - - edited I just committed this to the 0.20 branch and the trunk. Thanks Matei and Jothi!
          Hide
          Jothi Padmanabhan added a comment -

          Based on an offline discussion with Devaraj, made some minor changes to the patch.

          Test-Patch results:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          [exec]

          Show
          Jothi Padmanabhan added a comment - Based on an offline discussion with Devaraj, made some minor changes to the patch. Test-Patch results: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec]
          Hide
          Jothi Padmanabhan added a comment -

          Patch incorporating review comments.

          Test Patch result:

          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Show
          Jothi Padmanabhan added a comment - Patch incorporating review comments. Test Patch result: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          Devaraj Das added a comment -

          Some comments:
          Set the daemon attribute for the init threads.
          The termination of the main init thread should be fixed. The "while(true)" should be checking for the interrupt status.
          It would be better to use the static method from the Executors factory - Executors.newFixedThreadPool(int) instead of constructing a new thread pool using the explicit constructor.
          Don't have to catch Exception in JobInitManager.run

          Show
          Devaraj Das added a comment - Some comments: Set the daemon attribute for the init threads. The termination of the main init thread should be fixed. The "while(true)" should be checking for the interrupt status. It would be better to use the static method from the Executors factory - Executors.newFixedThreadPool(int) instead of constructing a new thread pool using the explicit constructor. Don't have to catch Exception in JobInitManager.run
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12401498/hadoop-4664-v2.patch
          against trunk revision 751463.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12401498/hadoop-4664-v2.patch against trunk revision 751463. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/console This message is automatically generated.
          Hide
          Eric Yang added a comment -

          Resubmit patch to hudson, trunk test was broken by HADOOP-5409.

          Show
          Eric Yang added a comment - Resubmit patch to hudson, trunk test was broken by HADOOP-5409 .
          Hide
          Jothi Padmanabhan added a comment -

          can you plz explain why ThreadPoolExecutor should be used

          When used with a smaller number of threads, there is not much of a difference between a ThreadPoolExecutor and explicitly managing threads . However, if and when we scale to using a larger number of threads, having a ThreadPoolExecutor manage is definitely a better option than managing explicitly.

          Show
          Jothi Padmanabhan added a comment - can you plz explain why ThreadPoolExecutor should be used When used with a smaller number of threads, there is not much of a difference between a ThreadPoolExecutor and explicitly managing threads . However, if and when we scale to using a larger number of threads, having a ThreadPoolExecutor manage is definitely a better option than managing explicitly.
          Hide
          Jothi Padmanabhan added a comment -

          Attaching a patch that changes JobHistory.openJobs to a concurrenthashmap

          Show
          Jothi Padmanabhan added a comment - Attaching a patch that changes JobHistory.openJobs to a concurrenthashmap
          Hide
          Jothi Padmanabhan added a comment -

          True. There are dfs calls in JobHistory*. However, I do not foresee a deadlock, these calls just might make it sequential in what otherwise could have been achieved in parallel, no? What this patch does is not insulate from slow data nodes, but just mitigates the effect.

          On the different thread-safely items:

          1. JobHistory.openJobs - Agreed. This needs to be made into a concurrentHashMap
          2. dnsToSwitchMapping. The method in question is resolve. From what I can see, this method is already thread safe in all its current implementations.
          3. JobEndNotifier.queue - This is already a delayedqueue.
          4. storeCompledJob() - CompletedJobStatusStore.store seems to be thread safe already.
          Show
          Jothi Padmanabhan added a comment - True. There are dfs calls in JobHistory*. However, I do not foresee a deadlock, these calls just might make it sequential in what otherwise could have been achieved in parallel, no? What this patch does is not insulate from slow data nodes, but just mitigates the effect. On the different thread-safely items: JobHistory.openJobs - Agreed. This needs to be made into a concurrentHashMap dnsToSwitchMapping. The method in question is resolve. From what I can see, this method is already thread safe in all its current implementations. JobEndNotifier.queue - This is already a delayedqueue. storeCompledJob() - CompletedJobStatusStore.store seems to be thread safe already.
          Hide
          Amar Kamat added a comment -

          Few comments :

          1. I feel Matei's implementation is simpler and does not involve the overhead of adding an extra thread. Tom, can you plz explain why ThreadPoolExecutor should be used?
          2. In JobInProgress.initTasks(), the first dfs call is made via JobHistory.logSubmitted(). This can also block on a dfs call made in JobHistory.getJobHistoryFileName() thus blocking all the other threads. Hence there is a corner case where all the threads will be blocked (on JobHistory). Here are the apis which are synchronized and might block on a dfs call
            1. JobHistory.getJobHistoryFileName() within JobHistory.logSubmitted()
            2. JobTracker.finalizeJob() and JobHistory.finalizeRecovery() within JobTracker.finalizeJob()
          3. All the api's invoked from JobInProgress.initTasks() should be made thread safe. Example, we should document that JobTracker.resolveAndAddToTopology() should be thread safe. Following are the apis that should be made thread safe
            Class api structure
            JobHistory logSubmitted() / logInited() / logFinished() / logFailed() / logJobPriority() openJobs
            JobTracker resolveAndAddToTopology() dnsToSwitchMapping
            JobEndNotifier registerNotification() queue
            JobTracker storeCompletedJob() completedJobStatusStore(looks at store() etc)

          Hey can you plz check if there are other such apis.


          In future we might want to associate a timer with each thread. We really dont want 3 out of 4 threads to be blocked for 1hr on dfs operations. But for now I think its a premature step.

          Show
          Amar Kamat added a comment - Few comments : I feel Matei's implementation is simpler and does not involve the overhead of adding an extra thread. Tom, can you plz explain why ThreadPoolExecutor should be used? In JobInProgress.initTasks() , the first dfs call is made via JobHistory.logSubmitted() . This can also block on a dfs call made in JobHistory.getJobHistoryFileName() thus blocking all the other threads. Hence there is a corner case where all the threads will be blocked (on JobHistory ). Here are the apis which are synchronized and might block on a dfs call JobHistory.getJobHistoryFileName() within JobHistory.logSubmitted() JobTracker.finalizeJob() and JobHistory.finalizeRecovery() within JobTracker.finalizeJob() All the api's invoked from JobInProgress.initTasks() should be made thread safe. Example, we should document that JobTracker.resolveAndAddToTopology() should be thread safe. Following are the apis that should be made thread safe Class api structure JobHistory logSubmitted() / logInited() / logFinished() / logFailed() / logJobPriority() openJobs JobTracker resolveAndAddToTopology() dnsToSwitchMapping JobEndNotifier registerNotification() queue JobTracker storeCompletedJob() completedJobStatusStore(looks at store() etc) Hey can you plz check if there are other such apis. In future we might want to associate a timer with each thread. We really dont want 3 out of 4 threads to be blocked for 1hr on dfs operations. But for now I think its a premature step.
          Hide
          Jothi Padmanabhan added a comment -

          Failed test, TestJobHistory, is unrelated.

          Show
          Jothi Padmanabhan added a comment - Failed test, TestJobHistory, is unrelated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12401313/hadoop-4664-v1.patch
          against trunk revision 749919.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12401313/hadoop-4664-v1.patch against trunk revision 749919. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/console This message is automatically generated.
          Hide
          Jothi Padmanabhan added a comment -

          Attaching a patch that uses ThreadPoolExecutor for thread management. It also contains a Test case to test parallel initialization.

          Show
          Jothi Padmanabhan added a comment - Attaching a patch that uses ThreadPoolExecutor for thread management. It also contains a Test case to test parallel initialization.
          Hide
          Jothi Padmanabhan added a comment -

          Matei, while the performance improvement due to parallelization is definitely something we would like (and that would need HADOOP-4372 as you have pointed out), the other motivation for this patch is that this might help mitigate the effect of HADOOP-5286, as Hemanth points out. I will get out a patch for this soon.

          Show
          Jothi Padmanabhan added a comment - Matei, while the performance improvement due to parallelization is definitely something we would like (and that would need HADOOP-4372 as you have pointed out), the other motivation for this patch is that this might help mitigate the effect of HADOOP-5286 , as Hemanth points out. I will get out a patch for this soon.
          Hide
          Matei Zaharia added a comment -

          No, go ahead and take it forward. Please note though that in my tests it didn't help much without something like HADOOP-4372 as well.

          Show
          Matei Zaharia added a comment - No, go ahead and take it forward. Please note though that in my tests it didn't help much without something like HADOOP-4372 as well.
          Hide
          Hemanth Yamijala added a comment -

          As per discussion in HADOOP-5286, we are trying to change the M/R design to not be affected by a single slow data node. So, marking this a blocker for Hadoop 0.20. Matei, if it is acceptable, Jothi has volunteered to take your patch forward, incorporating comments from Tom. Do you have any objections to this ?

          Show
          Hemanth Yamijala added a comment - As per discussion in HADOOP-5286 , we are trying to change the M/R design to not be affected by a single slow data node. So, marking this a blocker for Hadoop 0.20. Matei, if it is acceptable, Jothi has volunteered to take your patch forward, incorporating comments from Tom. Do you have any objections to this ?
          Hide
          Tom White added a comment -

          Removing patch from queue while issues are addressed.

          Show
          Tom White added a comment - Removing patch from queue while issues are addressed.
          Hide
          Tom White added a comment -

          Generally looks good. A few comments:

          Show
          Tom White added a comment - Generally looks good. A few comments: Rather than managing threads explicitly, it might be easier to use ThreadPoolExecutor ( http://java.sun.com/javase/6/docs/api/java/util/concurrent/ThreadPoolExecutor.html ) Do you think you can write a test for EagerTaskInitializationListener? There are some tabs in the patch.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12394016/parallel-job-init-v1.patch
          against trunk revision 718863.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12394016/parallel-job-init-v1.patch against trunk revision 718863. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/console This message is automatically generated.
          Hide
          Matei Zaharia added a comment - - edited

          In some initial testing of this patch on a jobtracker with a lot of old history files, I found that the lock in JobHistory on getJobHistoryFileName and recoverJobHistoryFile was causing most of the threads to block while one thread listed the directory, leading to no improvement. However, Amar Kamat explained that HADOOP-4372 will help solve this issue. I'll wait on that before trying to modify things myself. The patch provided here should still help when the job init phase is limited more by CPU than by the history file scanning and creation.

          Show
          Matei Zaharia added a comment - - edited In some initial testing of this patch on a jobtracker with a lot of old history files, I found that the lock in JobHistory on getJobHistoryFileName and recoverJobHistoryFile was causing most of the threads to block while one thread listed the directory, leading to no improvement. However, Amar Kamat explained that HADOOP-4372 will help solve this issue. I'll wait on that before trying to modify things myself. The patch provided here should still help when the job init phase is limited more by CPU than by the history file scanning and creation.
          Hide
          Vivek Ratan added a comment -

          Just as an FYI, we're doing something similar in the Capacity Scheduler (HADOOP-4513). We're initializing jobs asynchronously, and have a thread per queue, so jobs in different queues get initialized in parallel.

          Show
          Vivek Ratan added a comment - Just as an FYI, we're doing something similar in the Capacity Scheduler ( HADOOP-4513 ). We're initializing jobs asynchronously, and have a thread per queue, so jobs in different queues get initialized in parallel.
          Hide
          Matei Zaharia added a comment -

          Here is a patch for this issue. The patch adds multiple job init threads in the EagerTaskInitializationListener, which is used to initialize tasks by the default scheduler (JobQueueTaskScheduler) and the fair scheduler. The capacity scheduler actually initializes jobs in its assignTasks method, which happens in an RPC handler thread, so it can already do this in parallel (although it may be worth modifying it to have a separate set of job init threads so that the RPC handlers don't block waiting for a job to initialize).

          This patch also makes the CachedDNSToSwitchMap use a ConcurrentHashMap instead of a TreeMap for its rack resolving cache to avoid errors caused by multiple writes. (Cache-hit reads require no locks with ConcurrentHashMap.) Apart from the possibility of multiple writes to the resolution cache, I think I saw no other potentially conflict-inducing operations in initTasks, but I'd really welcome a second pair of eyes to look at it.

          The number of job init threads is configurable as mapred.jobinit.threads. I set it to 4 by default, but let me know if there are any objections.

          Show
          Matei Zaharia added a comment - Here is a patch for this issue. The patch adds multiple job init threads in the EagerTaskInitializationListener, which is used to initialize tasks by the default scheduler (JobQueueTaskScheduler) and the fair scheduler. The capacity scheduler actually initializes jobs in its assignTasks method, which happens in an RPC handler thread, so it can already do this in parallel (although it may be worth modifying it to have a separate set of job init threads so that the RPC handlers don't block waiting for a job to initialize). This patch also makes the CachedDNSToSwitchMap use a ConcurrentHashMap instead of a TreeMap for its rack resolving cache to avoid errors caused by multiple writes. (Cache-hit reads require no locks with ConcurrentHashMap.) Apart from the possibility of multiple writes to the resolution cache, I think I saw no other potentially conflict-inducing operations in initTasks, but I'd really welcome a second pair of eyes to look at it. The number of job init threads is configurable as mapred.jobinit.threads. I set it to 4 by default, but let me know if there are any objections.

            People

            • Assignee:
              Jothi Padmanabhan
              Reporter:
              Matei Zaharia
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development