Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1595

DFSClient may incorrectly detect datanode failure

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: datanode, hdfs-client
    • Labels:
      None

      Description

      Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that if S catches an exception when it is writing to D, then D is faulty and S is fine. As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .

      However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when transferring a small amount of data, say 1MB, but it often fails when transferring a large amount of data, say 100MB.

      It is even worst if F is the first datanode in the pipeline. Consider the following:

      1. DFSClient creates a pipeline with three datanodes. The first datanode is F.
      2. F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
      3. DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
      4. The pipeline now has two datanodes but (2) and (3) repeat.
      5. Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
      6. The write succeeds and DFSClient is able to close the file successfully.
      7. The block is under replicated. The NameNode schedules replication from F to some other datanode D.
      8. The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted.
      9. The NameNode marks the replica in F is corrupted.
      10. The block is corrupted since no replica is available.

      We were able to manually divide the replicas into small files and copy them out from F without fixing the hardware. The replicas seems uncorrupted. This is a data availability problem.

        Issue Links

          Activity

          Tsz Wo Nicholas Sze created issue -
          Tsz Wo Nicholas Sze made changes -
          Field Original Value New Value
          Link This issue relates to HDFS-101 [ HDFS-101 ]
          Tsz Wo Nicholas Sze made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          Description Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .

          However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB.

          It is even worst if F is the first datanode in the pipeline. Consider the following:
          # DFSClient creates a pipeline with three datanodes. The first of the datanode is F.
          # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
          # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
          # The pipeline now has two datanodes but (2) and (3) repeat.
          # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
          # The write succeeds and DFSClient is able to close the file successfully.
          # The block is under replicated. The NameNode schedules replication from F to some other datanode D.
          # The replication fails from the same reason. D reports to the NameNode that the replica in F is corrupted.
          # The NameNode marks the replica in F is corrupted.
          # The block is corrupted since no replica is available.
          Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .

          However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB.

          It is even worst if F is the first datanode in the pipeline. Consider the following:
          # DFSClient creates a pipeline with three datanodes. The first datanode is F.
          # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
          # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
          # The pipeline now has two datanodes but (2) and (3) repeat.
          # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
          # The write succeeds and DFSClient is able to *close the file successfully*.
          # The block is under replicated. The NameNode schedules replication from F to some other datanode D.
          # The replication fails from the same reason. D reports to the NameNode that the replica in F is corrupted.
          # The NameNode marks the replica in F is corrupted.
          # The block is corrupted since no replica is available.

          This is a *data loss* scenario.
          Todd Lipcon made changes -
          Attachment hdfs-1595-idea.txt [ 12469243 ]
          Tsz Wo Nicholas Sze made changes -
          Description Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .

          However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB.

          It is even worst if F is the first datanode in the pipeline. Consider the following:
          # DFSClient creates a pipeline with three datanodes. The first datanode is F.
          # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
          # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
          # The pipeline now has two datanodes but (2) and (3) repeat.
          # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
          # The write succeeds and DFSClient is able to *close the file successfully*.
          # The block is under replicated. The NameNode schedules replication from F to some other datanode D.
          # The replication fails from the same reason. D reports to the NameNode that the replica in F is corrupted.
          # The NameNode marks the replica in F is corrupted.
          # The block is corrupted since no replica is available.

          This is a *data loss* scenario.
          Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .

          However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB. Reading is working fine for any data size.

          It is even worst if F is the first datanode in the pipeline. Consider the following:
          # DFSClient creates a pipeline with three datanodes. The first datanode is F.
          # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
          # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
          # The pipeline now has two datanodes but (2) and (3) repeat.
          # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
          # The write succeeds and DFSClient is able to *close the file successfully*.
          # The block is under replicated. The NameNode schedules replication from F to some other datanode D.
          # The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted.
          # The NameNode marks the replica in F is corrupted.
          # The block is corrupted since no replica is available.

          This is a *data loss* scenario.
          Tsz Wo Nicholas Sze made changes -
          Description Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .

          However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when sending out a small amount of data, say 1MB, but it fails when sending out a large amount of data, say 100MB. Reading is working fine for any data size.

          It is even worst if F is the first datanode in the pipeline. Consider the following:
          # DFSClient creates a pipeline with three datanodes. The first datanode is F.
          # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
          # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
          # The pipeline now has two datanodes but (2) and (3) repeat.
          # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
          # The write succeeds and DFSClient is able to *close the file successfully*.
          # The block is under replicated. The NameNode schedules replication from F to some other datanode D.
          # The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted.
          # The NameNode marks the replica in F is corrupted.
          # The block is corrupted since no replica is available.

          This is a *data loss* scenario.
          Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that _if S catches an exception when it is writing to D, then D is faulty and S is fine._ As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .

          However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when transferring a small amount of data, say 1MB, but it often fails when transferring a large amount of data, say 100MB.

          It is even worst if F is the first datanode in the pipeline. Consider the following:
          # DFSClient creates a pipeline with three datanodes. The first datanode is F.
          # F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
          # DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
          # The pipeline now has two datanodes but (2) and (3) repeat.
          # Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
          # The write succeeds and DFSClient is able to *close the file successfully*.
          # The block is under replicated. The NameNode schedules replication from F to some other datanode D.
          # The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted.
          # The NameNode marks the replica in F is corrupted.
          # The block is corrupted since no replica is available.

          We were able to manually divide the replicas into small files and copy them out from F without fixing the hardware. The replicas seems uncorrupted. This is a *data availability problem*.
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HDFS-1606 [ HDFS-1606 ]
          Owen O'Malley made changes -
          Affects Version/s 0.20.4 [ 12316035 ]
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HDFS-3875 [ HDFS-3875 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Tsz Wo Nicholas Sze
            • Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development