Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9752

Permanent write failures may happen to slow writers during datanode rolling upgrades

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • None
    • 2.8.0, 2.7.3, 2.6.5, 3.0.0-alpha1
    • None
    • None
    • Reviewed

    Description

      When datanodes are being upgraded, an out-of-band ack is sent upstream and the client does a pipeline recovery. The client may hit this multiple times as more nodes get upgraded. This normally does not cause any issue, but if the client is holding the stream open without writing any data during this time, a permanent write failure can occur.

      This is because there is a limit of 5 recovery trials for the same packet, which is tracked by "last acked sequence number". Since the empty heartbeat packets for an idle output stream does not increment the sequence number, the write will fail after it seeing 5 pipeline breakages by datanode upgrades.

      This check/limit was added to avoid spinning until running out of nodes in the cluster due to a corruption or any other irrecoverable conditions. The datanode upgrade-restart should be excluded from the count.

      Attachments

        1. HDFS-9752.01.patch
          7 kB
          Walter Su
        2. HDFS-9752.02.patch
          7 kB
          Walter Su
        3. HDFS-9752.03.patch
          12 kB
          Walter Su
        4. HDFS-9752-branch-2.6.03.patch
          11 kB
          Walter Su
        5. HDFS-9752-branch-2.7.03.patch
          11 kB
          Walter Su
        6. HdfsWriter.java
          2 kB
          Xiaobing Zhou

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            walter.k.su Walter Su
            kihwal Kihwal Lee
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment