Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9106

Transfer failure during pipeline recovery causes permanent write failures

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.7.2, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      When a new node is added to a write pipeline during flush/sync, if the partial block transfer fails, the write will fail permanently without retrying or continuing with whatever is in the pipeline.

      The transfer often fails in busy clusters due to timeout. There is no per-packet ACK between client and datanode or between source and target datanodes. If the total transfer time exceeds the configured timeout + 10 seconds (2 * 5 seconds slack), it is considered failed. Naturally, the failure rate is higher with bigger block sizes.

      I propose following changes:

      • Transfer timeout needs to be different from per-packet timeout.
      • transfer should be retried if fails.

        Attachments

        1. HDFS-9106.branch-2.7.patch
          4 kB
          Kihwal Lee
        2. HDFS-9106.patch
          4 kB
          Kihwal Lee
        3. HDFS-9106-poc.patch
          4 kB
          Kihwal Lee

          Issue Links

            Activity

              People

              • Assignee:
                kihwal Kihwal Lee
                Reporter:
                kihwal Kihwal Lee
              • Votes:
                0 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: