Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Not A Problem
-
None
-
None
-
None
-
None
Description
With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.
After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:
08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...
With several such bad datanodes the probability of jobs failing goes up a lot.