Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-795

DFS Write pipeline does not detect defective datanode correctly in some cases (HADOOP-3339)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Duplicate
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.20.2
    • Component/s: hdfs-client
    • Labels:
      None

      Description

      HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.

      Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.

      It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.

      I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.

      My preferred target for this fix is 0.20.1.

        Issue Links

          Activity

          Todd Lipcon made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Duplicate [ 3 ]
          Hide
          Todd Lipcon added a comment -

          HDFS-101 duplicates this, and fix is under way there.

          Show
          Todd Lipcon added a comment - HDFS-101 duplicates this, and fix is under way there.
          Hide
          Todd Lipcon added a comment -

          Great, thanks Hairong. FYI this is happening in 0.20.1, and I think should be fixed in the branch as well. Let me know if you need help testing a patch - I can reproduce it pretty reliably.

          Show
          Todd Lipcon added a comment - Great, thanks Hairong. FYI this is happening in 0.20.1, and I think should be fixed in the branch as well. Let me know if you need help testing a patch - I can reproduce it pretty reliably.
          Hairong Kuang made changes -
          Link This issue duplicates HDFS-101 [ HDFS-101 ]
          Hide
          Hairong Kuang added a comment -

          I will get this issue fixed in HDFS-101.

          Show
          Hairong Kuang added a comment - I will get this issue fixed in HDFS-101 .
          Todd Lipcon made changes -
          Affects Version/s 0.20.1 [ 12314048 ]
          Priority Major [ 3 ] Critical [ 2 ]
          Component/s hdfs client [ 12312928 ]
          Hide
          Todd Lipcon added a comment -

          Upgrading to critical since this is reproducible and causes complete pipeline failure for writers.

          Show
          Todd Lipcon added a comment - Upgrading to critical since this is reproducible and causes complete pipeline failure for writers.
          Todd Lipcon made changes -
          Project Hadoop Common [ 12310240 ] Hadoop HDFS [ 12310942 ]
          Key HADOOP-5796 HDFS-795
          Affects Version/s 0.19.0 [ 12313211 ]
          Fix Version/s 0.20.2 [ 12314204 ]
          Fix Version/s 0.20.2 [ 12314203 ]
          Hide
          Todd Lipcon added a comment -

          Silly me, now I see that this patch is just to reproduce, not to fix Will investigate a fix since I have a good petri dish in which to reproduce this issue.

          Show
          Todd Lipcon added a comment - Silly me, now I see that this patch is just to reproduce, not to fix Will investigate a fix since I have a good petri dish in which to reproduce this issue.
          Hide
          Todd Lipcon added a comment -

          Not certain if what I"m seeing is the exact same cause, but I have another reproducible case in which the write pipeline recovery decides the first node is dead every time, when in actuality it's the last node that's dead. In my case, I've set up a 3-node HDFS cluster with replication 3, and each DN having one 100G volume and one 2G volume. The 2Gs fill up, throw DiskOutOfSpaceExceptions, and the write pipeline recovers incorrectly when the node that runs out of space is the last. It first ejects pipeline[0], fails again when trying to continue the write on the dead node, ejects the second, then tries again writing only to the failed node. Of course that fails too, and the whole write is aborted.

          I'll try applying this patch (and thinking it through a bit further) and seeing if it resolves the issue.

          Show
          Todd Lipcon added a comment - Not certain if what I"m seeing is the exact same cause, but I have another reproducible case in which the write pipeline recovery decides the first node is dead every time, when in actuality it's the last node that's dead. In my case, I've set up a 3-node HDFS cluster with replication 3, and each DN having one 100G volume and one 2G volume. The 2Gs fill up, throw DiskOutOfSpaceExceptions, and the write pipeline recovers incorrectly when the node that runs out of space is the last. It first ejects pipeline [0] , fails again when trying to continue the write on the dead node, ejects the second, then tries again writing only to the failed node. Of course that fails too, and the whole write is aborted. I'll try applying this patch (and thinking it through a bit further) and seeing if it resolves the issue.
          Owen O'Malley made changes -
          Fix Version/s 0.20.2 [ 12314203 ]
          Fix Version/s 0.20.1 [ 12313866 ]
          Raghu Angadi made changes -
          Priority Blocker [ 1 ] Major [ 3 ]
          Raghu Angadi made changes -
          Attachment toreproduce-5796.patch [ 12407650 ]
          Hide
          Raghu Angadi added a comment -

          Attached patch toreproduce-5796.patch helps illustrate the problem. How to reproduce :

          Create an HDFS cluster with 2 datanodes. For one of them set "dfs.datanode.address" to "0.0.0.0:50013". Now try to write a 5MB file. You will notice that when ever the 50013 is the last datanode in the pipeline, write is aborted.

          The hunk from patch for HADOOP-1700 that reverts the earlier fix :

          @@ -2214,10 +2218,15 @@
                         /* The receiver thread cancelled this thread. 
                          * We could also check any other status updates from the 
                          * receiver thread (e.g. if it is ok to write to replyOut). 
          +               * It is prudent to not send any more status back to the client
          +               * because this datanode has a problem. The upstream datanode
          +               * will detect a timout on heartbeats and will declare that
          +               * this datanode is bad, and rightly so.
                          */
                         LOG.info("PacketResponder " + block +  " " + numTargets +
                                  " : Thread is interrupted.");
                         running = false;
          +              continue;
                       }
                       
                       if (!didRead) {
          

          I don't think the added justification is always correct.

          Suggested fix :
          ============

          • the loop should 'continue' if write to the local disk fails.
          • it should not if write to downstream mirror fails. (this test case)
          Show
          Raghu Angadi added a comment - Attached patch toreproduce-5796.patch helps illustrate the problem. How to reproduce : Create an HDFS cluster with 2 datanodes. For one of them set "dfs.datanode.address" to "0.0.0.0:50013". Now try to write a 5MB file. You will notice that when ever the 50013 is the last datanode in the pipeline, write is aborted. The hunk from patch for HADOOP-1700 that reverts the earlier fix : @@ -2214,10 +2218,15 @@ /* The receiver thread cancelled this thread. * We could also check any other status updates from the * receiver thread (e.g. if it is ok to write to replyOut). + * It is prudent to not send any more status back to the client + * because this datanode has a problem. The upstream datanode + * will detect a timout on heartbeats and will declare that + * this datanode is bad, and rightly so. */ LOG.info("PacketResponder " + block + " " + numTargets + " : Thread is interrupted."); running = false; + continue; } if (!didRead) { I don't think the added justification is always correct. Suggested fix : ============ the loop should 'continue' if write to the local disk fails. it should not if write to downstream mirror fails. (this test case)
          Raghu Angadi made changes -
          Field Original Value New Value
          Priority Major [ 3 ] Blocker [ 1 ]
          Description
          HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.

          Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.

          It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.

          I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.

          My preferred target for this fix is 0.20.1.
          HDFS write pipeline does not select the correct datanode in some error cases. One example : say DN2 is the second datanode and write to it times out since it is in a bad state.. pipeline actually removes the first datanode. If such a datanode happens to be the last one in the pipeline, write is aborted completely with a hard error.

          Essentially the error occurs when writing to a downstream datanode fails rather than reading. This bug was actually fixed in 0.18 (HADOOP-3339). But HADOOP-1700 essentially reverted it. I am not sure why.

          It is absolutely essential for HDFS to handle failures on subset of datanodes in a pipeline. We should not have at least known bugs that lead to hard failures.

          I will attach patch for a hack that illustrates this problem. Still thinking of how an automated test would look like for this one.

          My preferred target for this fix is 0.20.1.
          Raghu Angadi created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Raghu Angadi
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development