[TEZ-4416] Dead lock triggered by ShuffleScheduler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.10.1
Fix Version/s: 0.10.3
Component/s: None
Labels:
None

Target Version/s:

0.10.1
Flags:

Important
Language:
- JAVA

Description

How this bug is found:

I was executing a sql with Hive on tez on a cluster that has low disk capacity. An exception was thrown during the execution (which is quite reasonable). Yet the task didn't stop normally, but keep hanging there for a very long while. Therefore, I printed out the jstack and did some investigation. Here's what I found.

(The .jstack file and the screenshot of jstack segment are attached below.)

How this dead lock is triggered:

Fail to copy files on local disk, which will trigger copyFailed() from FetcherOrderedGrouped.copyFromHost(), which is a synchronized method on ShuffleScheduler instance.
Method called from 1 will eventually goes to ShuffleScheduler.close(), in which it tries to kill the Referee's thread by calling referee.interrupt() and referee.join().
Meanwhile, Referee is waiting for ShuffleScheduler's instance lock in its run() method, which is hold by the process from 1. Hence a dead lock happens.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot.PNG
25/May/22 08:10
97 kB
Omega-Ariston
container.jstack
25/May/22 08:11
36 kB
Omega-Ariston

Issue Links

relates to

TEZ-4334 Fix deadlock in ShuffleScheduler between ShuffleScheduler.close() and the ShufflePenaltyReferee thread

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Omega-Ariston

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/May/22 08:33

Updated:: 07/Mar/23 09:03

Resolved:: 07/Mar/23 09:03