Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-130

high rate of task failures because of bad or full datanodes

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None

    Description

      With 0.17 we notice a fast rate of task failures because of the same bad data nodes being reported repeatedly as badFirstLink. We never saw this in 0.16.

      After running less than 20,000 map tasks, more than 2,500 of them reported a single certain datanode as badFirstLink, with typical exception of the form:

      08/09/09 14:41:14 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:38788 remote=/xxx.yyy.zzz.ttt:50010]
      08/09/09 14:41:14 INFO dfs.DFSClient: Abandoning block blk_-3650954811734254315
      08/09/09 14:41:14 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
      08/09/09 14:44:29 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException: 189000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx.yyy.zzz.ttt:39014 remote=/xxx.yyy.zzz.ttt:50010]
      08/09/09 14:44:29 INFO dfs.DFSClient: Abandoning block blk_8665387817606483066
      08/09/09 14:44:29 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
      08/09/09 14:47:35 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
      08/09/09 14:47:35 INFO dfs.DFSClient: Abandoning block blk_8475261758012143524
      08/09/09 14:47:35 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
      08/09/09 14:50:42 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink ip.bad.data.node:50010
      08/09/09 14:50:42 INFO dfs.DFSClient: Abandoning block blk_4847638219960634858
      08/09/09 14:50:42 INFO dfs.DFSClient: Waiting to find target node: xxx.yyy.zzz.ttt:50010
      08/09/09 14:50:48 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
      08/09/09 14:50:48 WARN dfs.DFSClient: Error Recovery for block blk_4847638219960634858 bad datanode[2]
      Exception in thread "main" java.io.IOException: Could not get block locations. Aborting...

      With several such bad datanodes the probability of jobs failing goes up a lot.

      Attachments

        Activity

          People

            Unassigned Unassigned
            ckunz Christian Kunz
            Votes:
            2 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: