[HADOOP-4977] Deadlock between reclaimCapacity and assignTasks - ASF JIRA

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.19.0
Fix Version/s: 0.20.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

I was running the latest trunk with the capacity scheduler and saw the JobTracker lock up with the following deadlock reported in jstack:

Found one Java-level deadlock:
=============================
"18107298@qtp0-4":
waiting to lock monitor 0x08085b40 (object 0x56605100, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 4 on 54311"
"IPC Server handler 4 on 54311":
waiting to lock monitor 0x0808594c (object 0x5660e518, a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr),
which is held by "reclaimCapacity"
"reclaimCapacity":
waiting to lock monitor 0x08085b40 (object 0x56605100, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 4 on 54311"

Java stack information for the threads listed above:
===================================================
"18107298@qtp0-4":
at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:2695)

waiting to lock <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:93)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
"IPC Server handler 4 on 54311":
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564)
waiting to lock <0x5660e518> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:855)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$1000(CapacityTaskScheduler.java:294)
at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1336)
locked <0x5660dd20> (a org.apache.hadoop.mapred.CapacityTaskScheduler)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2288)
locked <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)

Unfortunately I didn't manage to select all of the output by mistake, so some is missing, but it appears that reclaimCapacity locks the MapSchedulingMgr and then tries to lock the JobTracker, whereas the updateQSIObjects called in assignTasks holds a lock on the JobTracker (the JobTracker grabs this lock when it calls assignTasks) and then tries to lock the MapSchedulingMgr. The other thread listed there is a Jetty thread for the web interface and isn't part of the circular locking. The solution to this would be to lock the JobTracker in reclaimCapacity before locking anything else.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jstack.txt
02/Jan/09 23:23
5 kB
Matei Alexandru Zaharia
4977.1.patch
09/Jan/09 08:39
19 kB
Vivek Ratan
4977.2.patch
09/Jan/09 10:53
20 kB
Vivek Ratan
4977.3.patch
12/Jan/09 13:25
20 kB
Vivek Ratan
4977.4.patch
14/Jan/09 03:38
22 kB
Vivek Ratan
4977.4.patch
15/Jan/09 01:49
21 kB
Hemanth Yamijala

Activity

Ascending order - Click to sort in descending order

Matei Alexandru Zaharia added a comment - 02/Jan/09 23:23

I managed to reproduce this, here's the full jstack output.

Matei Alexandru Zaharia added a comment - 02/Jan/09 23:23 I managed to reproduce this, here's the full jstack output.

Arun Murthy added a comment - 05/Jan/09 01:43

Marking this as a blocker...

Arun Murthy added a comment - 05/Jan/09 01:43 Marking this as a blocker...

Vinod Kumar Vavilapalli added a comment - 06/Jan/09 09:57

Found this while running tests for ~~HADOOP-4830~~. Thought it might help.

    [junit] Found one Java-level deadlock:
    [junit] =============================
    [junit] "IPC Server handler 9 on 51089":
    [junit]   waiting to lock monitor 0x08120f38 (object 0xe5346a40, a org.apache.hadoop.mapred.JobTracker),
    [junit]   which is held by "IPC Server handler 7 on 51089"
    [junit] "IPC Server handler 7 on 51089":
    [junit]   waiting to lock monitor 0x08120ce0 (object 0xe5346d28, a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr),
    [junit]   which is held by "reclaimCapacity"
    [junit] "reclaimCapacity":
    [junit]   waiting to lock monitor 0x08120f38 (object 0xe5346a40, a org.apache.hadoop.mapred.JobTracker),
    [junit]   which is held by "IPC Server handler 7 on 51089"
    [junit] 
    [junit] Java stack information for the threads listed above:
    [junit] ===================================================
    [junit] "IPC Server handler 9 on 51089":
    [junit]     at org.apache.hadoop.mapred.JobTracker.getJobStatus(JobTracker.java:2783)
    [junit]     - waiting to lock <0xe5346a40> (a org.apache.hadoop.mapred.JobTracker)
    [junit]     at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
    [junit]     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    [junit]     at java.lang.reflect.Method.invoke(Method.java:597)
    [junit]     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
    [junit]     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
    [junit]     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
    [junit]     at java.security.AccessController.doPrivileged(Native Method)
    [junit]     at javax.security.auth.Subject.doAs(Subject.java:396)
    [junit]     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
    [junit] "IPC Server handler 7 on 51089":
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564)
    [junit]     - waiting to lock <0xe5346d28> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:855)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$1000(CapacityTaskScheduler.java:294)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1336)
    [junit]     - locked <0xe5346cd8> (a org.apache.hadoop.mapred.CapacityTaskScheduler)
    [junit]     at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2288)
    [junit]     - locked <0xe5346a40> (a org.apache.hadoop.mapred.JobTracker)
    [junit]     at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
    [junit]     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    [junit]     at java.lang.reflect.Method.invoke(Method.java:597)
    [junit]     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
    [junit]     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
    [junit]     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
    [junit]     at java.security.AccessController.doPrivileged(Native Method)
    [junit]     at javax.security.auth.Subject.doAs(Subject.java:396)
    [junit]     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
    [junit] "reclaimCapacity":
    [junit]     at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:2695)
    [junit]     - waiting to lock <0xe5346a40> (a org.apache.hadoop.mapred.JobTracker)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.getClusterCapacity(CapacityTaskScheduler.java:939)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564)
    [junit]     - locked <0xe5346d28> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.reclaimCapacity(CapacityTaskScheduler.java:405)
    [junit]     - locked <0xe5346d28> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$700(CapacityTaskScheduler.java:294)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler.reclaimCapacity(CapacityTaskScheduler.java:1278)
    [junit]     at org.apache.hadoop.mapred.CapacityTaskScheduler$ReclaimCapacity.run(CapacityTaskScheduler.java:1084)
    [junit]     at java.lang.Thread.run(Thread.java:619)
    [junit] 
    [junit] Found 1 deadlock.

Vinod Kumar Vavilapalli added a comment - 06/Jan/09 09:57 Found this while running tests for HADOOP-4830 . Thought it might help. [junit] Found one Java-level deadlock: [junit] ============================= [junit] "IPC Server handler 9 on 51089" : [junit] waiting to lock monitor 0x08120f38 (object 0xe5346a40, a org.apache.hadoop.mapred.JobTracker), [junit] which is held by "IPC Server handler 7 on 51089" [junit] "IPC Server handler 7 on 51089" : [junit] waiting to lock monitor 0x08120ce0 (object 0xe5346d28, a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr), [junit] which is held by "reclaimCapacity" [junit] "reclaimCapacity" : [junit] waiting to lock monitor 0x08120f38 (object 0xe5346a40, a org.apache.hadoop.mapred.JobTracker), [junit] which is held by "IPC Server handler 7 on 51089" [junit] [junit] Java stack information for the threads listed above: [junit] =================================================== [junit] "IPC Server handler 9 on 51089" : [junit] at org.apache.hadoop.mapred.JobTracker.getJobStatus(JobTracker.java:2783) [junit] - waiting to lock <0xe5346a40> (a org.apache.hadoop.mapred.JobTracker) [junit] at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) [junit] at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) [junit] at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) [junit] at java.security.AccessController.doPrivileged(Native Method) [junit] at javax.security.auth.Subject.doAs(Subject.java:396) [junit] at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) [junit] "IPC Server handler 7 on 51089" : [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564) [junit] - waiting to lock <0xe5346d28> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:855) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$1000(CapacityTaskScheduler.java:294) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1336) [junit] - locked <0xe5346cd8> (a org.apache.hadoop.mapred.CapacityTaskScheduler) [junit] at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2288) [junit] - locked <0xe5346a40> (a org.apache.hadoop.mapred.JobTracker) [junit] at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) [junit] at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) [junit] at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) [junit] at java.security.AccessController.doPrivileged(Native Method) [junit] at javax.security.auth.Subject.doAs(Subject.java:396) [junit] at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) [junit] "reclaimCapacity" : [junit] at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:2695) [junit] - waiting to lock <0xe5346a40> (a org.apache.hadoop.mapred.JobTracker) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.getClusterCapacity(CapacityTaskScheduler.java:939) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564) [junit] - locked <0xe5346d28> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.reclaimCapacity(CapacityTaskScheduler.java:405) [junit] - locked <0xe5346d28> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$700(CapacityTaskScheduler.java:294) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler.reclaimCapacity(CapacityTaskScheduler.java:1278) [junit] at org.apache.hadoop.mapred.CapacityTaskScheduler$ReclaimCapacity.run(CapacityTaskScheduler.java:1084) [junit] at java.lang. Thread .run( Thread .java:619) [junit] [junit] Found 1 deadlock.

Amar Kamat added a comment - 07/Jan/09 07:50

Identified 2 scenarios where the deadlock can happen :-
1) ReclaimCapacity thread calls TaskSchedulingMgr.reclaimCapacity() which internally calls updateQSIObjects() which internally calls TaskTrackerManager.getClusterStatus() which is a synchronized call.
2) ReclaimCapacity thread calls TaskSchedulingMgr.reclaimCapacity() which internally calls TaskTrackerManager.getNextHeartbeatInterval() which internally calls TaskTrackerManager.getClusterStatus() which is a synchronized call.

Note that this can happen any time whenever a thread (from capacity scheduler) makes back call to TaskTrackerManager which tries to take a lock on the TaskTrackerManager while the TaskTrackerManager itself invokes TaskScheduler's api after locking itself. The whole deadlock issue can be summarized as follows

Who	Via	Locks?	Needs to lock?	Via
JobTracker	JobTracker.heartbeat()	JobTracker(itself)	TaskSchedulingMgr	CapacityTaskScheduler.assignTasks() calls TaskSchedulingMg.assignTasks() which calls updateQSIObjects() which is a synchronized call
CapacityScheduler.ReclaimCapacityThread	TaskSchedulingMgr.reclaimCapacity()	TaskSchedulingMgr	JobTracker	TaskSchedulingMgr.reclaimCapacity() calls TaskSchedulingMgr.updateQSIObjects() which calls JobTracker.getClusterStatus() which is a synchronized call and TaskSchedulingMgr.reclaimCapacity() which calls JobTracker.getNextHeartbeatInterval() which is a synchronized call

Spoke to Hemanth and Vivek on this and we all agree that various cluster parameters (cluster-status, heartbeat-interval) should be obtained/cached before invoking reclaimCapacity() and this should be a common rule while adding new threads.

Amar Kamat added a comment - 07/Jan/09 07:50 Identified 2 scenarios where the deadlock can happen :- 1) ReclaimCapacity thread calls TaskSchedulingMgr.reclaimCapacity() which internally calls updateQSIObjects() which internally calls TaskTrackerManager.getClusterStatus() which is a synchronized call. 2) ReclaimCapacity thread calls TaskSchedulingMgr.reclaimCapacity() which internally calls TaskTrackerManager.getNextHeartbeatInterval() which internally calls TaskTrackerManager.getClusterStatus() which is a synchronized call. Note that this can happen any time whenever a thread (from capacity scheduler) makes back call to TaskTrackerManager which tries to take a lock on the TaskTrackerManager while the TaskTrackerManager itself invokes TaskScheduler 's api after locking itself. The whole deadlock issue can be summarized as follows Who Via Locks? Needs to lock? Via JobTracker JobTracker.heartbeat() JobTracker(itself) TaskSchedulingMgr CapacityTaskScheduler.assignTasks() calls TaskSchedulingMg.assignTasks() which calls updateQSIObjects() which is a synchronized call CapacityScheduler.ReclaimCapacityThread TaskSchedulingMgr.reclaimCapacity() TaskSchedulingMgr JobTracker TaskSchedulingMgr.reclaimCapacity() calls TaskSchedulingMgr.updateQSIObjects() which calls JobTracker.getClusterStatus() which is a synchronized call and TaskSchedulingMgr.reclaimCapacity() which calls JobTracker.getNextHeartbeatInterval() which is a synchronized call Spoke to Hemanth and Vivek on this and we all agree that various cluster parameters (cluster-status, heartbeat-interval) should be obtained/cached before invoking reclaimCapacity() and this should be a common rule while adding new threads.

Hemanth Yamijala added a comment - 08/Jan/09 07:13

I think we should get this addressed very soon, now that ~~HADOOP-4980~~ is committed. Also, because I committed ~~HADOOP-4830~~, folks might hit this often while running tests. One Hudson build failed due to this, as reported on the core-dev mailing list.

Hemanth Yamijala added a comment - 08/Jan/09 07:13 I think we should get this addressed very soon, now that HADOOP-4980 is committed. Also, because I committed HADOOP-4830 , folks might hit this often while running tests. One Hudson build failed due to this, as reported on the core-dev mailing list.

Vivek Ratan added a comment - 09/Jan/09 08:39

Attaching patch (4977.1.patch). As Amar points out, the problem is that JT loks itself, then calls the scheduler's assignTasks, which tries getting a lock for one of the scheduler's objects. In the meantime, a separate thread in the scheduler locks thsi objects, then calls a TaskTrackerManager method. TaskTrackerManager is implemented by the JT. Hence the deadlock.

The fix is for threads in the scheduler to call TaskTrackerManager first, before locking anything in the Scheduler.

I've made the following changes:

I've moved updateQSIObjects() from TaskSchedulingMgr to CapacitYScheduler. We may as well update both the map and reduce tasks in one go, rather than do them separately and walk the list of jobs in a queue twice.
updateQSI was called in three places: assignTasks (when processing a heartbeat), the reclaimCapacity thread, and in test cases. In all these calls, we get the cluster information from TaskTrackerManager first, then update the QSI objects.
I renamed one of the methods from updateQSIInfo to updateQSIInfoForTests to better suggest what it does. Hence the minor changes in TestCapacityScheduler.java.

Vivek Ratan added a comment - 09/Jan/09 08:39 Attaching patch (4977.1.patch). As Amar points out, the problem is that JT loks itself, then calls the scheduler's assignTasks, which tries getting a lock for one of the scheduler's objects. In the meantime, a separate thread in the scheduler locks thsi objects, then calls a TaskTrackerManager method. TaskTrackerManager is implemented by the JT. Hence the deadlock. The fix is for threads in the scheduler to call TaskTrackerManager first, before locking anything in the Scheduler. I've made the following changes: I've moved updateQSIObjects() from TaskSchedulingMgr to CapacitYScheduler. We may as well update both the map and reduce tasks in one go, rather than do them separately and walk the list of jobs in a queue twice. updateQSI was called in three places: assignTasks (when processing a heartbeat), the reclaimCapacity thread, and in test cases. In all these calls, we get the cluster information from TaskTrackerManager first, then update the QSI objects. I renamed one of the methods from updateQSIInfo to updateQSIInfoForTests to better suggest what it does. Hence the minor changes in TestCapacityScheduler.java.

Vivek Ratan added a comment - 09/Jan/09 10:53

Attaching new patch (4977.2.patch). Amar pointed out one more instance where reclaimCapacity() was calling TaskTrackerManager - to get the heartbeat interval. That has been fixed. reclaimCapacity() no longer calls any method in TaskTrackerManager.

Vivek Ratan added a comment - 09/Jan/09 10:53 Attaching new patch (4977.2.patch). Amar pointed out one more instance where reclaimCapacity() was calling TaskTrackerManager - to get the heartbeat interval. That has been fixed. reclaimCapacity() no longer calls any method in TaskTrackerManager.

Amar Kamat added a comment - 12/Jan/09 11:38

Few comments:

updateQSIObjects() and updateQSIObjects(...) might make it confusing as to which api to use. There seems to be no documentation to clarify which to call when. So I feel we can have only one api named updateQSIObjects() which first takes a snapshot of the required parameters from TaskTrackerManager and then syncs on the remaining main code.
there is a extra line which says "for debugging"

Amar Kamat added a comment - 12/Jan/09 11:38 Few comments: updateQSIObjects() and updateQSIObjects(...) might make it confusing as to which api to use. There seems to be no documentation to clarify which to call when. So I feel we can have only one api named updateQSIObjects() which first takes a snapshot of the required parameters from TaskTrackerManager and then syncs on the remaining main code. there is a extra line which says "for debugging"

Vivek Ratan added a comment - 12/Jan/09 13:25

Valid points, Amar. I did some extra refactoring at the end and didn't realize the confusing calls. I do want to keep the update of the QSI objects separate from the fetching of the cluster status, as both these can come under different synchronization calls, if required. So each of the three callers: assignTasks(), reclaimCapacity, and the test cases, get the cluster stats on their own and then call updateQSIObjects(). See attached patch (4977.3.patch).

The method printQSIs() is useful for printing debug information. We don't call it right now (we did, earlier), but I wanted to keep it around as it can prove quite useful.

Vivek Ratan added a comment - 12/Jan/09 13:25 Valid points, Amar. I did some extra refactoring at the end and didn't realize the confusing calls. I do want to keep the update of the QSI objects separate from the fetching of the cluster status, as both these can come under different synchronization calls, if required. So each of the three callers: assignTasks(), reclaimCapacity, and the test cases, get the cluster stats on their own and then call updateQSIObjects(). See attached patch (4977.3.patch). The method printQSIs() is useful for printing debug information. We don't call it right now (we did, earlier), but I wanted to keep it around as it can prove quite useful.

Amar Kamat added a comment - 13/Jan/09 06:22

Looks good. +1. We need to manually test this patch for deadlocks.

Amar Kamat added a comment - 13/Jan/09 06:22 Looks good. +1. We need to manually test this patch for deadlocks.

Vivek Ratan added a comment - 14/Jan/09 03:38

Attaching another patch (4977.4.patch) with some minor cleanup - I removed a couple of methods that were no longer needed.

Vivek Ratan added a comment - 14/Jan/09 03:38 Attaching another patch (4977.4.patch) with some minor cleanup - I removed a couple of methods that were no longer needed.

Hemanth Yamijala added a comment - 15/Jan/09 01:51

Attached a patch on which I've run dos2unix. Vivek, in future, can you please make sure the patch files do not have the windows EOL characters - maybe some setting in the editor can change this.

Also, in future, please attach test-patch results when you are uploading a patch. Here are the results for this one:

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

Hemanth Yamijala added a comment - 15/Jan/09 01:51 Attached a patch on which I've run dos2unix. Vivek, in future, can you please make sure the patch files do not have the windows EOL characters - maybe some setting in the editor can change this. Also, in future, please attach test-patch results when you are uploading a patch. Here are the results for this one: [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

Hemanth Yamijala added a comment - 15/Jan/09 02:58

I committed this patch to trunk and Hadoop 0.20. Thanks, Vivek !

Hemanth Yamijala added a comment - 15/Jan/09 02:58 I committed this patch to trunk and Hadoop 0.20. Thanks, Vivek !

People

Assignee:: Vivek Ratan

Reporter:: Matei Alexandru Zaharia

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Jan/09 23:18

Updated:: 08/Jul/09 16:40

Resolved:: 15/Jan/09 02:58

Hadoop Common

Details

Description

Attachments

Attachments

Activity

People

Dates