Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2310

Deadlock caused by StateChangeNotifier sending notifications on thread holding locks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.7.0
    • None
    • None
    • Reviewed

    Description

      See the following deadlock in testing:

      Thread#1:

      Daemon Thread [App Shared Pool - #3] (Suspended)	
      	owns: VertexManager$VertexManagerPluginContextImpl  (id=327)	
      	owns: ShuffleVertexManager  (id=328)	
      	owns: VertexManager  (id=329)	
      	waiting for: VertexManager$VertexManagerPluginContextImpl  (id=326)	
      	VertexManager$VertexManagerPluginContextImpl.onStateUpdated(VertexStateUpdate) line: 344	
      	StateChangeNotifier$ListenerContainer.sendStateUpdate(VertexStateUpdate) line: 138	
      	StateChangeNotifier$ListenerContainer.access$100(StateChangeNotifier$ListenerContainer, VertexStateUpdate) line: 122	
      	StateChangeNotifier.sendStateUpdate(TezVertexID, VertexStateUpdate) line: 116	
      	StateChangeNotifier.stateChanged(TezVertexID, VertexStateUpdate) line: 106	
      	VertexImpl.maybeSendConfiguredEvent() line: 3385	
      	VertexImpl.doneReconfiguringVertex() line: 1634	
      	VertexManager$VertexManagerPluginContextImpl.doneReconfiguringVertex() line: 339	
      	ShuffleVertexManager.schedulePendingTasks(int) line: 561	
      	ShuffleVertexManager.schedulePendingTasks() line: 620	
      	ShuffleVertexManager.handleVertexStateUpdate(VertexStateUpdate) line: 731	
      	ShuffleVertexManager.onVertexStateUpdated(VertexStateUpdate) line: 744	
      	VertexManager$VertexManagerEventOnVertexStateUpdate.invoke() line: 527	
      	VertexManager$VertexManagerEvent$1.run() line: 612	
      	VertexManager$VertexManagerEvent$1.run() line: 607	
      	AccessController.doPrivileged(PrivilegedExceptionAction<T>, AccessControlContext) line: not available [native method]	
      	Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415	
      	UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548	
      	VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() line: 607	
      	VertexManager$VertexManagerEventOnVertexStateUpdate(VertexManager$VertexManagerEvent).call() line: 596	
      	ListenableFutureTask<V>(FutureTask<V>).run() line: 262	
      	ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145	
      	ThreadPoolExecutor$Worker.run() line: 615	
      	Thread.run() line: 745	
      

      Thread #2

      Daemon Thread [App Shared Pool - #2] (Suspended)	
      	owns: VertexManager$VertexManagerPluginContextImpl  (id=326)	
      	owns: PigGraceShuffleVertexManager  (id=344)	
      	owns: VertexManager  (id=345)	
      	Unsafe.park(boolean, long) line: not available [native method]	
      	LockSupport.park(Object) line: 186	
      	ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).parkAndCheckInterrupt() line: 834	
      	ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).doAcquireShared(int) line: 964	
      	ReentrantReadWriteLock$NonfairSync(AbstractQueuedSynchronizer).acquireShared(int) line: 1282	
      	ReentrantReadWriteLock$ReadLock.lock() line: 731	
      	VertexImpl.getTotalTasks() line: 952	
      	VertexManager$VertexManagerPluginContextImpl.getVertexNumTasks(String) line: 162	
      	PigGraceShuffleVertexManager(ShuffleVertexManager).updateSourceTaskCount() line: 435	
      	PigGraceShuffleVertexManager(ShuffleVertexManager).onVertexStarted(Map<String,List<Integer>>) line: 353	
      	VertexManager$VertexManagerEventOnVertexStarted.invoke() line: 541	
      	VertexManager$VertexManagerEvent$1.run() line: 612	
      	VertexManager$VertexManagerEvent$1.run() line: 607	
      	AccessController.doPrivileged(PrivilegedExceptionAction<T>, AccessControlContext) line: not available [native method]	
      	Subject.doAs(Subject, PrivilegedExceptionAction<T>) line: 415	
      	UserGroupInformation.doAs(PrivilegedExceptionAction<T>) line: 1548	
      	VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call() line: 607	
      	VertexManager$VertexManagerEventOnVertexStarted(VertexManager$VertexManagerEvent).call() line: 596	
      	ListenableFutureTask<V>(FutureTask<V>).run() line: 262	
      	ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1145	
      	ThreadPoolExecutor$Worker.run() line: 615	
      	Thread.run() line: 745	
      

      What happens is thread #1 holding a writeLock (VertexImpl:1628) and enter into a synchronized block (ShuffleVertexManager.onVertexStateUpdated), in the mean time, thread #2 already in the synchronized block (ShuffleVertexManager.onVertexStarted) and try to get a readLock(VertexImpl:952). Holding a lock and then enter a synchronized block might be dangerous.

      I attach a patch which avoiding that and then deadlock goes away. Not sure if that is the right fix or if any other patterns like this.

      Attachments

        1. TEZ-2310-0.patch
          0.6 kB
          Daniel Dai
        2. TEZ-2310.1.patch
          15 kB
          Bikas Saha
        3. TEZ-2310.2.patch
          17 kB
          Bikas Saha
        4. TEZ-2310.3.patch
          20 kB
          Bikas Saha

        Issue Links

          Activity

            People

              bikassaha Bikas Saha
              daijy Daniel Dai
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: