Issue Details (XML | Word | Printable)

Key: HADOOP-4664
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Blocker Blocker
Assignee: Jothi Padmanabhan
Reporter: Matei Zaharia
Votes: 0
Watchers: 8
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

Parallelize job initialization

Created: 15/Nov/08 04:45 AM   Updated: 08/Jul/09 04:53 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 0.20.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works hadoop-4664-v1.patch 2009-03-03 02:10 PM Jothi Padmanabhan 13 kB
Text File Licensed for inclusion in ASF works hadoop-4664-v2.patch 2009-03-05 07:22 AM Jothi Padmanabhan 14 kB
Text File Licensed for inclusion in ASF works hadoop-4664-v3.patch 2009-03-12 12:58 PM Jothi Padmanabhan 14 kB
Text File Licensed for inclusion in ASF works hadoop-4664-v4.patch 2009-03-12 03:28 PM Jothi Padmanabhan 14 kB
Text File Licensed for inclusion in ASF works parallel-job-init-v1.patch 2008-11-16 07:21 PM Matei Zaharia 5 kB
Issue Links:
Incorporates
 

Hadoop Flags: Reviewed
Resolution Date: 12/Mar/09 04:59 PM


 Description  « Hide
The job init thread currently initializes one job at a time. However, this is a lengthy and partly IO-bound process because all of the job's block locations need to be resolved through the namenode and a map of them needs to be built. It can take tens of seconds. As a result, the cluster sometimes initializes jobs too slowly for full utilization to be achieved, if there are many small jobs queued up. It would be better to have a pool of threads that initialize multiple jobs in parallel. One thing to be careful of, however, is not causing deadlocks or holding locks for too long in these threads.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Matei Zaharia added a comment - 16/Nov/08 07:21 PM
Here is a patch for this issue. The patch adds multiple job init threads in the EagerTaskInitializationListener, which is used to initialize tasks by the default scheduler (JobQueueTaskScheduler) and the fair scheduler. The capacity scheduler actually initializes jobs in its assignTasks method, which happens in an RPC handler thread, so it can already do this in parallel (although it may be worth modifying it to have a separate set of job init threads so that the RPC handlers don't block waiting for a job to initialize).

This patch also makes the CachedDNSToSwitchMap use a ConcurrentHashMap instead of a TreeMap for its rack resolving cache to avoid errors caused by multiple writes. (Cache-hit reads require no locks with ConcurrentHashMap.) Apart from the possibility of multiple writes to the resolution cache, I think I saw no other potentially conflict-inducing operations in initTasks, but I'd really welcome a second pair of eyes to look at it.

The number of job init threads is configurable as mapred.jobinit.threads. I set it to 4 by default, but let me know if there are any objections.


Vivek Ratan added a comment - 17/Nov/08 03:59 AM
Just as an FYI, we're doing something similar in the Capacity Scheduler (HADOOP-4513). We're initializing jobs asynchronously, and have a thread per queue, so jobs in different queues get initialized in parallel.

Matei Zaharia added a comment - 17/Nov/08 06:12 AM - edited
In some initial testing of this patch on a jobtracker with a lot of old history files, I found that the lock in JobHistory on getJobHistoryFileName and recoverJobHistoryFile was causing most of the threads to block while one thread listed the directory, leading to no improvement. However, Amar Kamat explained that HADOOP-4372 will help solve this issue. I'll wait on that before trying to modify things myself. The patch provided here should still help when the job init phase is limited more by CPU than by the history file scanning and creation.

Hadoop QA added a comment - 20/Nov/08 01:34 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12394016/parallel-job-init-v1.patch
against trunk revision 718863.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 javadoc. The javadoc tool appears to have generated 1 warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3605/console

This message is automatically generated.


Tom White added a comment - 09/Feb/09 11:40 AM
Generally looks good. A few comments:

Tom White added a comment - 09/Feb/09 11:40 AM
Removing patch from queue while issues are addressed.

Hemanth Yamijala added a comment - 27/Feb/09 01:44 PM
As per discussion in HADOOP-5286, we are trying to change the M/R design to not be affected by a single slow data node. So, marking this a blocker for Hadoop 0.20. Matei, if it is acceptable, Jothi has volunteered to take your patch forward, incorporating comments from Tom. Do you have any objections to this ?

Matei Zaharia added a comment - 27/Feb/09 06:32 PM
No, go ahead and take it forward. Please note though that in my tests it didn't help much without something like HADOOP-4372 as well.

Jothi Padmanabhan added a comment - 02/Mar/09 02:59 PM
Matei, while the performance improvement due to parallelization is definitely something we would like (and that would need HADOOP-4372 as you have pointed out), the other motivation for this patch is that this might help mitigate the effect of HADOOP-5286, as Hemanth points out. I will get out a patch for this soon.

Jothi Padmanabhan added a comment - 03/Mar/09 02:10 PM
Attaching a patch that uses ThreadPoolExecutor for thread management. It also contains a Test case to test parallel initialization.

Hadoop QA added a comment - 04/Mar/09 08:11 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12401313/hadoop-4664-v1.patch
against trunk revision 749919.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/44/console

This message is automatically generated.


Jothi Padmanabhan added a comment - 05/Mar/09 03:42 AM
Failed test, TestJobHistory, is unrelated.

Amar Kamat added a comment - 05/Mar/09 04:31 AM
Few comments :
  1. I feel Matei's implementation is simpler and does not involve the overhead of adding an extra thread. Tom, can you plz explain why ThreadPoolExecutor should be used?
  2. In JobInProgress.initTasks(), the first dfs call is made via JobHistory.logSubmitted(). This can also block on a dfs call made in JobHistory.getJobHistoryFileName() thus blocking all the other threads. Hence there is a corner case where all the threads will be blocked (on JobHistory). Here are the apis which are synchronized and might block on a dfs call
    1. JobHistory.getJobHistoryFileName() within JobHistory.logSubmitted()
    2. JobTracker.finalizeJob() and JobHistory.finalizeRecovery() within JobTracker.finalizeJob()
  3. All the api's invoked from JobInProgress.initTasks() should be made thread safe. Example, we should document that JobTracker.resolveAndAddToTopology() should be thread safe. Following are the apis that should be made thread safe
    Class api structure
    JobHistory logSubmitted() / logInited() / logFinished() / logFailed() / logJobPriority() openJobs
    JobTracker resolveAndAddToTopology() dnsToSwitchMapping
    JobEndNotifier registerNotification() queue
    JobTracker storeCompletedJob() completedJobStatusStore(looks at store() etc)

Hey can you plz check if there are other such apis.


In future we might want to associate a timer with each thread. We really dont want 3 out of 4 threads to be blocked for 1hr on dfs operations. But for now I think its a premature step.


Jothi Padmanabhan added a comment - 05/Mar/09 05:22 AM
True. There are dfs calls in JobHistory*. However, I do not foresee a deadlock, these calls just might make it sequential in what otherwise could have been achieved in parallel, no? What this patch does is not insulate from slow data nodes, but just mitigates the effect.

On the different thread-safely items:

  1. JobHistory.openJobs - Agreed. This needs to be made into a concurrentHashMap
  2. dnsToSwitchMapping. The method in question is resolve. From what I can see, this method is already thread safe in all its current implementations.
  3. JobEndNotifier.queue - This is already a delayedqueue.
  4. storeCompledJob() - CompletedJobStatusStore.store seems to be thread safe already.

Jothi Padmanabhan added a comment - 05/Mar/09 07:22 AM
Attaching a patch that changes JobHistory.openJobs to a concurrenthashmap

Jothi Padmanabhan added a comment - 05/Mar/09 10:57 AM

can you plz explain why ThreadPoolExecutor should be used

When used with a smaller number of threads, there is not much of a difference between a ThreadPoolExecutor and explicitly managing threads . However, if and when we scale to using a larger number of threads, having a ThreadPoolExecutor manage is definitely a better option than managing explicitly.


Eric Yang added a comment - 05/Mar/09 09:33 PM
Resubmit patch to hudson, trunk test was broken by HADOOP-5409.

Hadoop QA added a comment - 09/Mar/09 05:21 AM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12401498/hadoop-4664-v2.patch
against trunk revision 751463.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-minerva.apache.org/31/console

This message is automatically generated.


Devaraj Das added a comment - 12/Mar/09 11:33 AM
Some comments:
Set the daemon attribute for the init threads.
The termination of the main init thread should be fixed. The "while(true)" should be checking for the interrupt status.
It would be better to use the static method from the Executors factory - Executors.newFixedThreadPool(int) instead of constructing a new thread pool using the explicit constructor.
Don't have to catch Exception in JobInitManager.run

Jothi Padmanabhan added a comment - 12/Mar/09 12:58 PM
Patch incorporating review comments.

Test Patch result:

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.


Jothi Padmanabhan added a comment - 12/Mar/09 03:28 PM
Based on an offline discussion with Devaraj, made some minor changes to the patch.

Test-Patch results:

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
[exec]


Devaraj Das added a comment - 12/Mar/09 04:59 PM - edited
I just committed this to the 0.20 branch and the trunk. Thanks Matei and Jothi!

Hadoop QA added a comment - 12/Mar/09 08:13 PM
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12402051/hadoop-4664-v4.patch
against trunk revision 752927.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 3 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 Eclipse classpath. The patch retains Eclipse classpath integrity.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/77/console

This message is automatically generated.


Hudson added a comment - 13/Mar/09 03:05 PM
Integrated in Hadoop-trunk #778 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/778/)
. Introduces multiple job initialization threads, where the number of threads are configurable via mapred.jobinit.threads. Contributed by (Matei Zaharia and Jothi Padmanabhan.