Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
See the following deadlock in testing:
Thread#1:
Daemon Thread [App Shared Pool - #3] (Suspended) owns: VertexManager$VertexManagerPluginContextImpl (id=327) owns: ShuffleVertexManager (id=328) owns: VertexManager (id=329) waiting for: VertexManager$VertexManagerPluginContextImpl (id=326) VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate) line: 344 StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) line: 138 StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer, VertexStateUpdate) line: 122 StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) line: 116 StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: 106 VertexImpl.maybeSendConfiguredEvent() line: 3385 VertexImpl.doneReconfiguringVertex() line: 1634 VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() line: 339 ShuffleVertexManager.schedulePendingTasks(int) line: 561 ShuffleVertexManager.schedulePendingTasks() line: 620 ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: 731 ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744 VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527 VertexManager$VertexManagerEvent$1.run() line: 612 VertexManager$VertexManagerEvent$1.run() line: 607 AccessController.doPrivileged(PrivilegedExceptionAction<T>, AccessControlContext) line: not available [native method] Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415 UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548 VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() line: 607 VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() line: 596 ListenableFutureTask<V>(FutureTask<V>).run() line: 262 ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145 ThreadPoolExecutor$Worker.run() line: 615 Thread.run() line: 745
Thread #2
Daemon Thread [App Shared Pool - #2] (Suspended) owns: VertexManager$VertexManagerPluginContextImpl (id=326) owns: PigGraceShuffleVertexManager (id=344) owns: VertexManager (id=345) Unsafe.park(boolean, long) line: not available [native method] LockSupport.park(Object) line: 186 ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt() line: 834 ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int) line: 964 ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int) line: 1282 ReentrantReadWriteLock$ReadLock.lock() line: 731 VertexImpl.getTotalTasks() line: 952 VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) line: 162 PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() line: 435 PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>) line: 353 VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541 VertexManager$VertexManagerEvent$1.run() line: 612 VertexManager$VertexManagerEvent$1.run() line: 607 AccessController.doPrivileged(PrivilegedExceptionAction<T>, AccessControlContext) line: not available [native method] Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415 UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548 VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call() line: 607 VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call() line: 596 ListenableFutureTask<V>(FutureTask<V>).run() line: 262 ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145 ThreadPoolExecutor$Worker.run() line: 615 Thread.run() line: 745
What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter into a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the mean time, thread #2 already in the synchronized block (ShuffleVertexManager.onVertexStarted) and try to get a readLock(VertexImpl:952). Holding a lock and then enter a synchronized block might be dangerous.
I attach a patch which avoiding that and then deadlock goes away. Not sure if that is the right fix or if any other patterns like this.