Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-4272

Problem in DFSInputStream read retry logic may cause early failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Assume the following call logic

       
      readWithStrategy()
        -> blockSeekTo()
        -> readBuffer()
           -> reader.doRead()
           -> seekToNewSource() add currentNode to deadnode, wish to get a different datanode
              -> blockSeekTo()
                 -> chooseDataNode()
                    -> block missing, clear deadNodes and pick the currentNode again
              seekToNewSource() return false
           readBuffer() re-throw the exception quit loop
      readWithStrategy() got the exception,  and may fail the read call before tried MaxBlockAcquireFailures.
      

      some issues of the logic:
      1. seekToNewSource() logic is broken because it may clear deadNodes in the middle.
      2. the variable "int retries=2" in readWithStrategy seems have conflict with MaxBlockAcquireFailures, should it be removed?

      I write a test to produce the scenario, and here is part of the log:

       
      2012-12-05 22:55:15,135 WARN  hdfs.DFSClient (DFSInputStream.java:readBuffer(596)) - Found Checksum error for BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from 127.0.0.1:50099 at 0
      2012-12-05 22:55:15,136 INFO  DataNode.clienttrace (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: /127.0.0.1:50105, bytes: 4128, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: DS-91625336-192.168.0.101-50099-1354719314603, blockid: BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, duration: 2925000
      2012-12-05 22:55:15,136 INFO  hdfs.DFSClient (DFSInputStream.java:chooseDataNode(741)) - Could not obtain BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
      2012-12-05 22:55:15,136 WARN  hdfs.DFSClient (DFSInputStream.java:chooseDataNode(756)) - DFS chooseDataNode: got # 1 IOException, will wait for 274.34891931868265 msec.
      2012-12-05 22:55:15,413 INFO  DataNode.clienttrace (BlockSender.java:sendBlock(672)) - src: /127.0.0.1:50099, dest: /127.0.0.1:50106, bytes: 4128, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-1488457569_1, offset: 0, srvID: DS-91625336-192.168.0.101-50099-1354719314603, blockid: BP-50712310-192.168.0.101-1354719313473:blk_-705068286766485620_1002, duration: 283000
      2012-12-05 22:55:15,414 INFO  hdfs.StateChange (FSNamesystem.java:reportBadBlocks(4761)) - *DIR* reportBadBlocks
      2012-12-05 22:55:15,415 INFO  BlockStateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(66)) - BLOCK NameSystem.addToCorruptReplicasMap: blk_-705068286766485620 added as corrupt on 127.0.0.1:50099 by null because client machine reported it
      2012-12-05 22:55:15,416 INFO  hdfs.TestClientReportBadBlock (TestDFSInputStream.java:testDFSInputStreamReadRetryTime(94)) - catch IOExceptionorg.apache.hadoop.fs.ChecksumException: Checksum error: /testFile at 0 exp: 809972010 got: -1374622118
      2012-12-05 22:55:15,431 INFO  hdfs.MiniDFSCluster (MiniDFSCluster.java:shutdown(1411)) - Shutting down the Mini HDFS Cluster
      

      Attachments

        Activity

          People

            decster Binglin Chang
            decster Binglin Chang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: