Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-795

DFS Write pipeline does not detect defective datanode correctly in some cases (HADOOP-3339)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Duplicate
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.20.2
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.

      Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.

      It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.

      I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.

      My preferred target for this fix is 0.20.1.

        Issue Links

          Activity

          Raghu Angadi created issue -
          Raghu Angadi made changes -
          Field Original Value New Value
          Priority Major [ 3 ] Blocker [ 1 ]
          Description
          HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.

          Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.

          It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.

          I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.

          My preferred target for this fix is 0.20.1.
          HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.

          Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.

          It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.

          I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.

          My preferred target for this fix is 0.20.1.
          Raghu Angadi made changes -
          Attachment toreproduce-5796.patch [ 12407650 ]
          Raghu Angadi made changes -
          Priority Blocker [ 1 ] Major [ 3 ]
          Owen O'Malley made changes -
          Fix Version/s 0.20.2 [ 12314203 ]
          Fix Version/s 0.20.1 [ 12313866 ]
          Todd Lipcon made changes -
          Project Hadoop Common [ 12310240 ] Hadoop HDFS [ 12310942 ]
          Key HADOOP-5796 HDFS-795
          Affects Version/s 0.19.0 [ 12313211 ]
          Fix Version/s 0.20.2 [ 12314204 ]
          Fix Version/s 0.20.2 [ 12314203 ]
          Todd Lipcon made changes -
          Affects Version/s 0.20.1 [ 12314048 ]
          Priority Major [ 3 ] Critical [ 2 ]
          Component/s hdfs client [ 12312928 ]
          Hairong Kuang made changes -
          Link This issue duplicates HDFS-101 [ HDFS-101 ]
          Todd Lipcon made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Raghu Angadi
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development