[HDFS-4851] Deadlock in pipeline recovery - ASF JIRA

XML

Word

Printable

JSON

Here's a deadlock scenario that cropped up during pipeline recovery, debugged through jstacks. Todd tipped me off to this one.

Pipeline fails, client initiates recovery. We have the old leftover DataXceiver, and a new one doing recovery.
New DataXceiver does recoverRbw, grabbing the FsDatasetImpl lock
Old DataXceiver is in BlockReceiver#computePartialChunkCrc, calls FsDatasetImpl#getTmpInputStreams and blocks on the FsDatasetImpl lock.
New DataXceiver ReplicaInPipeline#stopWriter, interrupting the old DataXceiver and then joining on it.
Boom, deadlock. New DX holds the FsDatasetImpl lock and is joining on the old DX, which is in turn waiting on the FsDatasetImpl lock.

duplicates

HDFS-3655 Datanode recoverRbw could hang sometime