Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.12.3
-
None
-
None
Description
Thanks to Koji for the attached stack-trace...
Summary:
main()
-> offerService()
-> markUnresponsiveTasks (locks the TaskTracker here)
-> purgeTask()
-> removeTaskFromJob (waiting to lock the RunningJob object)
taskCleanup
-> purgeJob (locks the RunningJob object)
-> TIP.jobHasFinished()
-> TIP.cleanup (waiting to lock the TaskTracker)
-
Clear-case of ordering issues during synchronization... it's a corner-case since it depends on the child-vm getting unresponsive and the cleanup thread kicking in; which is why I'm marking this for 0.14.0 rather than 0.13.0 - what do others think about this?
-
Two possible solutions to break the deadlock cycle:
a) Make TaskTracker.purgeJob a synchronized method, thus it locks the TaskTracker before locking the RunningJob method.
b) Make the TaskTracker.tasks map a Collections.synchronizedMap, thus doing away with the need to lock the TaskTracker in TIP.cleanup
I'd prefer a) since the TaskTracker.tasks is referenced in multiple places in synchronized methods... and hence is a less intrusive change.
Thoughts?