Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4851

Deadlock in pipeline recovery

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.0.4-alpha, 3.0.0-alpha1
    • None
    • datanode
    • None

    Description

      Here's a deadlock scenario that cropped up during pipeline recovery, debugged through jstacks. Todd tipped me off to this one.

      1. Pipeline fails, client initiates recovery. We have the old leftover DataXceiver, and a new one doing recovery.
      2. New DataXceiver does recoverRbw, grabbing the FsDatasetImpl lock
      3. Old DataXceiver is in BlockReceiver#computePartialChunkCrc, calls FsDatasetImpl#getTmpInputStreams and blocks on the FsDatasetImpl lock.
      4. New DataXceiver ReplicaInPipeline#stopWriter, interrupting the old DataXceiver and then joining on it.
      5. Boom, deadlock. New DX holds the FsDatasetImpl lock and is joining on the old DX, which is in turn waiting on the FsDatasetImpl lock.

      Attachments

        1. hdfs-4851-1.patch
          1 kB
          Andrew Wang

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            andrew.wang Andrew Wang
            andrew.wang Andrew Wang
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment