[HDFS-4851] Deadlock in pipeline recovery - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Here's a deadlock scenario that cropped up during pipeline recovery, debugged through jstacks. Todd tipped me off to this one.

Pipeline fails, client initiates recovery. We have the old leftover DataXceiver, and a new one doing recovery.
New DataXceiver does recoverRbw, grabbing the FsDatasetImpl lock
Old DataXceiver is in BlockReceiver#computePartialChunkCrc, calls FsDatasetImpl#getTmpInputStreams and blocks on the FsDatasetImpl lock.
New DataXceiver ReplicaInPipeline#stopWriter, interrupting the old DataXceiver and then joining on it.
Boom, deadlock. New DX holds the FsDatasetImpl lock and is joining on the old DX, which is in turn waiting on the FsDatasetImpl lock.