Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-101

DFS write pipeline : DFSClient sometimes does not detect second datanode failure

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.20.1, 0.20-append
    • Fix Version/s: 0.20.2, 0.20-append, 0.21.0
    • Component/s: datanode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When the first datanode's write to second datanode fails or times out DFSClient ends up marking first datanode as the bad one and removes it from the pipeline. Similar problem exists on DataNode as well and it is fixed in HADOOP-3339. From HADOOP-3339 :

      "The main issue is that BlockReceiver thread (and DataStreamer in the case of DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty coarse control. We don't know what state the responder is in and interrupting has different effects depending on responder state. To fix this properly we need to redesign how we handle these interactions."

      When the first datanode closes its socket from DFSClient, DFSClient should properly read all the data left in the socket.. Also, DataNode's closing of the socket should not result in a TCP reset, otherwise I think DFSClient will not be able to read from the socket.

      1. hdfs-101-branch-0.20-append-cdh3.txt
        11 kB
        Todd Lipcon
      2. HDFS-101_20-append.patch
        12 kB
        Nicolas Spiegelberg
      3. pipelineHeartbeat_yahoo.patch
        4 kB
        Hairong Kuang
      4. pipelineHeartbeat.patch
        4 kB
        Hairong Kuang
      5. detectDownDN3-0.20-yahoo.patch
        10 kB
        Hairong Kuang
      6. detectDownDN3.patch
        8 kB
        Hairong Kuang
      7. detectDownDN3-0.20.patch
        9 kB
        Hairong Kuang
      8. detectDownDN2.patch
        7 kB
        Hairong Kuang
      9. detectDownDN1-0.20.patch
        9 kB
        Hairong Kuang
      10. hdfs-101.tar.gz
        3 kB
        Todd Lipcon
      11. detectDownDN-0.20.patch
        8 kB
        Hairong Kuang

        Issue Links

          Activity

          Raghu Angadi created issue -
          Raghu Angadi made changes -
          Field Original Value New Value
          Description

          When the first datanode's write to second datanode fails or times out DFSClient ends up marking first datanode as the bad one and removes it from the pipeline. Similar problem exists on DataNode as well and it is fixed in HADOOP-3339. From HADOOP-3339 :

          "The main issue is that BlockReceiver thread (and DataStreamer in the case of DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty coarse control. We don't know what state the responder is in and interrupting has different effects depending on responder state. To fix this properly we need to redesign how we handle these interactions."

          When the first datanode closes its socket from DFSClient, DFSClient should properly read all the data left in the socket.. Also, DataNode's closing of the socket should not result in a TCP reset, otherwise I think DFSClient will not be able to read from the socket.

          When the first datanode's write to second datanode fails or times out DFSClient ends up marking first datanode as the bad one and removes it from the pipeline. Similar problem exists on DataNode as well and it is fixed in HADOOP-3339. From HADOOP-3339 :

          "The main issue is that BlockReceiver thread (and DataStreamer in the case of DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty coarse control. We don't know what state the responder is in and interrupting has different effects depending on responder state. To fix this properly we need to redesign how we handle these interactions."

          When the first datanode closes its socket from DFSClient, DFSClient should properly read all the data left in the socket.. Also, DataNode's closing of the socket should not result in a TCP reset, otherwise I think DFSClient will not be able to read from the socket.
          Component/s dfs [ 12310710 ]
          Raghu Angadi made changes -
          Link This issue relates to HADOOP-3339 [ HADOOP-3339 ]
          dhruba borthakur made changes -
          Link This issue blocks HADOOP-4278 [ HADOOP-4278 ]
          Owen O'Malley made changes -
          Project Hadoop Common [ 12310240 ] HDFS [ 12310942 ]
          Key HADOOP-3416 HDFS-101
          Affects Version/s 0.16.0 [ 12312740 ]
          Component/s dfs [ 12310710 ]
          Kan Zhang made changes -
          Link This issue relates to HDFS-564 [ HDFS-564 ]
          Robert Chansler made changes -
          Fix Version/s 0.21.0 [ 12314046 ]
          Priority Major [ 3 ] Blocker [ 1 ]
          Description
          When the first datanode's write to second datanode fails or times out DFSClient ends up marking first datanode as the bad one and removes it from the pipeline. Similar problem exists on DataNode as well and it is fixed in HADOOP-3339. From HADOOP-3339 :

          "The main issue is that BlockReceiver thread (and DataStreamer in the case of DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty coarse control. We don't know what state the responder is in and interrupting has different effects depending on responder state. To fix this properly we need to redesign how we handle these interactions."

          When the first datanode closes its socket from DFSClient, DFSClient should properly read all the data left in the socket.. Also, DataNode's closing of the socket should not result in a TCP reset, otherwise I think DFSClient will not be able to read from the socket.
          When the first datanode's write to second datanode fails or times out DFSClient ends up marking first datanode as the bad one and removes it from the pipeline. Similar problem exists on DataNode as well and it is fixed in HADOOP-3339. From HADOOP-3339 :

          "The main issue is that BlockReceiver thread (and DataStreamer in the case of DFSClient) interrupt() the 'responder' thread. But interrupting is a pretty coarse control. We don't know what state the responder is in and interrupting has different effects depending on responder state. To fix this properly we need to redesign how we handle these interactions."

          When the first datanode closes its socket from DFSClient, DFSClient should properly read all the data left in the socket.. Also, DataNode's closing of the socket should not result in a TCP reset, otherwise I think DFSClient will not be able to read from the socket.
          Hairong Kuang made changes -
          Assignee Hairong Kuang [ hairong ]
          Hairong Kuang made changes -
          Link This issue is blocked by HDFS-793 [ HDFS-793 ]
          Hairong Kuang made changes -
          Link This issue is duplicated by HDFS-795 [ HDFS-795 ]
          Todd Lipcon made changes -
          Affects Version/s 0.20.1 [ 12314048 ]
          Hairong Kuang made changes -
          Attachment detectDownDN.patch [ 12428262 ]
          Hairong Kuang made changes -
          Attachment detectDownDN-0.20.patch [ 12428361 ]
          Hairong Kuang made changes -
          Attachment detectDownDN1.patch [ 12428366 ]
          Hairong Kuang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Todd Lipcon made changes -
          Attachment hdfs-101.tar.gz [ 12428370 ]
          Hairong Kuang made changes -
          Attachment detectDownDN1-0.20.patch [ 12428383 ]
          Hairong Kuang made changes -
          Attachment detectDownDN2.patch [ 12428384 ]
          Hairong Kuang made changes -
          Attachment detectDownDN.patch [ 12428262 ]
          Hairong Kuang made changes -
          Attachment detectDownDN1.patch [ 12428366 ]
          Hairong Kuang made changes -
          Attachment detectDownDN3-0.20.patch [ 12428498 ]
          Attachment detectDownDN3.patch [ 12428499 ]
          Hairong Kuang made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Fix Version/s 0.20.2 [ 12314204 ]
          Fix Version/s 0.22.0 [ 12314241 ]
          Resolution Fixed [ 1 ]
          Tsz Wo Nicholas Sze made changes -
          Link This issue incorporates HDFS-700 [ HDFS-700 ]
          Suresh Srinivas made changes -
          Attachment detectDownDN4-0.20.patch [ 12436670 ]
          Hairong Kuang made changes -
          Attachment detectDownDN4-0.20.patch [ 12436670 ]
          Hairong Kuang made changes -
          Attachment detectDownDN3-0.20-yahoo.patch [ 12437819 ]
          Hairong Kuang made changes -
          Attachment pipelineHeartbeat.patch [ 12439379 ]
          Hairong Kuang made changes -
          Attachment pipelineHeartbeat_yahoo.patch [ 12439387 ]
          Tom White made changes -
          Fix Version/s 0.22.0 [ 12314241 ]
          Nicolas Spiegelberg made changes -
          Affects Version/s 0.20-append [ 12315103 ]
          Nicolas Spiegelberg made changes -
          Link This issue blocks HDFS-142 [ HDFS-142 ]
          Nicolas Spiegelberg made changes -
          Attachment HDFS-101_20-append.patch [ 12446611 ]
          Todd Lipcon made changes -
          Attachment hdfs-101-branch-0.20-append-cdh3.txt [ 12447257 ]
          dhruba borthakur made changes -
          Fix Version/s 0.20-append [ 12315103 ]
          Tsz Wo Nicholas Sze made changes -
          Component/s data-node [ 12312927 ]
          Hairong Kuang made changes -
          Link This issue relates to HDFS-1346 [ HDFS-1346 ]
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Tsz Wo Nicholas Sze made changes -
          Link This issue is related to HDFS-1595 [ HDFS-1595 ]
          Gavin made changes -
          Link This issue blocks HDFS-142 [ HDFS-142 ]
          Gavin made changes -
          Link This issue is depended upon by HDFS-142 [ HDFS-142 ]

            People

            • Assignee:
              Hairong Kuang
              Reporter:
              Raghu Angadi
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development