This problem exists in branch-1.0 also.
This can happen in the following scenario :
1. consider the pipeline as [ DN1 -> DN2 -> DN3 ]
2. create one file and get the stream.
3. write some bytes using that stream and call sync.
4. keep the stream open.
5. Now unplug network / power-off / ethernet down in DN2 machine.
The explanation is as follows :
Consider the case, when the caller is not writing any data and dataStreamer & ResponseProcessor threads are running And also DN2 machine ethernet down:
1. At time t1 , ResponseProcessor will start reading the ack from DN1 [timeOut is 69 secs ]
2. But in the DN1 , packetResponder not yet started reading the ack. It will be waiting on ackQueue until one packet arrives.
3. After time t1+34.5 secs only, dataStreamer will stream the HEART_BEAT packet to DN1. [if there is no data packet, DataStreamer will send HEART_BEAT packet after waiting for half of the timeout value]
4. Then only, DataXceiver will receive the packet and will put it in ackQueue in DN side.
5. At time t2 , Once the ackQueue is enqueued, the packetResponder will start reading the ack from DN2. [timeOut is 66 secs]
6. As DN2 machine ethernet is down, packetResponder in DN1 is not getting the reply.
7. But packetResponder will get timeOut only after t2+66 secs.
8. Hence ResponseProcessor is getting socketTimeOutException earlier than PacketResponder.
t2 - t1 >= 34.5 secs [if its greater than 3 secs itself, the reported scenario can happen]
So, DFSClient is getting SocketTimeOutException before DN1.
Hence DFSClient is detecting DN1 as bad datanode [which is up] and not detecting DN2 as bad datanode [which is down]
In DataNode, the PacketResponder will start reading the ack only when it receives one packet [either data or heartbeat packet]
But In DFSClient, the ResponseProcessor will start reading the ack before sending packet.