Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-6804

Add test for race condition between transferring block and appending block causes "Unexpected checksum mismatch exception"

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.8.4
    • Component/s: datanode
    • Labels:
      None

      Description

      We found some error log in the datanode. like this

      2014-07-22 01:49:51,338 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Ex
      ception for BP-2072804351-192.168.2.104-1406008383435:blk_1073741997_9248
      java.io.IOException: Terminating due to a checksum error.java.io.IOException: Unexpected checksum mismatch while writing BP-2072804351-192.168.2.104-1406008383435:blk_1073741997_9248 from /192.168.2.101:39495
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:536)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:703)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:575)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
              at java.lang.Thread.run(Thread.java:744)
      

      While on the source datanode, the log says the block is transmitted.

      2014-07-22 01:49:50,805 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Da
      taTransfer: Transmitted BP-2072804351-192.168.2.104-1406008383435:blk_1073741997
      _9248 (numBytes=16188152) to /192.168.2.103:50010
      

      When the destination datanode gets the checksum mismatch, it reports bad block to NameNode and NameNode marks the replica on the source datanode as corrupt. But actually, the replica on the source datanode is valid. Because the replica can pass the checksum verification.

      In all, the replica on the source data is wrongly marked as corrupted.

        Attachments

        1. HDFS-6804-branch-2.8.patch
          6 kB
          Brahma Reddy Battula
        2. HDFS-6804-branch-2.8-002.patch
          6 kB
          Brahma Reddy Battula
        3. HDFS-6804-branch-2.8-003.patch
          6 kB
          Brahma Reddy Battula
        4. Testcase_append_transfer_block.patch
          5 kB
          Brahma Reddy Battula

          Issue Links

            Activity

              People

              • Assignee:
                brahmareddy Brahma Reddy Battula
                Reporter:
                wangg23 Gordon Wang
              • Votes:
                2 Vote for this issue
                Watchers:
                27 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: