Details
Description
How this bug is found:
I was executing a sql with Hive on tez on a cluster that has low disk capacity. An exception was thrown during the execution (which is quite reasonable). Yet the task didn't stop normally, but keep hanging there for a very long while. Therefore, I printed out the jstack and did some investigation. Here's what I found.
(The .jstack file and the screenshot of jstack segment are attached below.)
How this dead lock is triggered:
- Fail to copy files on local disk, which will trigger copyFailed() from FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on ShuffleScheduler instance.
- Method called from 1 will eventually goes to ShuffleScheduler.close(), in which it tries to kill the Referee's thread by calling referee.interrupt() and referee.join().
- Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its run() method, which is hold by the process from 1. Hence a dead lock happens.
Attachments
Attachments
Issue Links
- relates to
-
TEZ-4334 Fix deadlock in ShuffleScheduler between ShuffleScheduler.close() and the ShufflePenaltyReferee thread
- Resolved