Suppose a source datanode S is writing to a destination datanode D in a write pipeline. We have an implicit assumption that if S catches an exception when it is writing to D, then D is faulty and S is fine. As a result, DFSClient will take out D from the pipeline, reconstruct the write pipeline with the remaining datanodes and then continue writing .
However, we find a case that the faulty machine F is indeed S but not D. In the case we found, F has a faulty network interface (or a faulty switch port) in such a way that the faulty network interface works fine when transferring a small amount of data, say 1MB, but it often fails when transferring a large amount of data, say 100MB.
It is even worst if F is the first datanode in the pipeline. Consider the following:
- DFSClient creates a pipeline with three datanodes. The first datanode is F.
- F catches an IOException when writing to the second datanode. Then, F reports the second datanode has error.
- DFSClient removes the second datanode from the pipeline and continue writing with the remaining datanode(s).
- The pipeline now has two datanodes but (2) and (3) repeat.
- Now, only F remains in the pipeline. DFSClient continues writing with one replica in F.
- The write succeeds and DFSClient is able to close the file successfully.
- The block is under replicated. The NameNode schedules replication from F to some other datanode D.
- The replication fails for the same reason. D reports to the NameNode that the replica in F is corrupted.
- The NameNode marks the replica in F is corrupted.
- The block is corrupted since no replica is available.
We were able to manually divide the replicas into small files and copy them out from F without fixing the hardware. The replicas seems uncorrupted. This is a data availability problem.