Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15837

Incorrect bytes causing block corruption after update pipeline and recovery failure

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.8.5
    • None
    • datanode, hdfs
    • None

    Description

      We are seeing cases on HDFS blocks being marked as bad after the initial block receive fails during update pipeline followed by HDFS recovery for the block failing as well. Here is the life cycle of a block blk_1342440165_272630578 that was ultimately marked as corrupt:

      1. The block creation starts at name node as part of update pipeline process:

      2021-01-25 03:41:17,335 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 61 on 8020): updatePipeline(blk_1342440165_272500939 => blk_1342440165_272630578) success

      2. The block receiver on the data node fails with a socket timeout exception, and so do the retries:

      2021-01-25 03:42:22,525 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
      java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.21.226.26:56294 remote=/172.21.246.239:50010]
          at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
          at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
          at java.io.FilterInputStream.read(FilterInputStream.java:83)
          at java.io.FilterInputStream.read(FilterInputStream.java:83)
          at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:400)
          at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1305)
          at java.lang.Thread.run(Thread.java:748)
      
      2021-01-25 03:42:22,526 WARN org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): IOException in BlockReceiver.run(): 
      java.io.IOException: Connection reset by peer
          at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
          at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
          at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
          at sun.nio.ch.IOUtil.write(IOUtil.java:65)
          at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
          at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
          at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
          at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
          at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
          at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
          at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
          at java.io.DataOutputStream.flush(DataOutputStream.java:123)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
          at java.lang.Thread.run(Thread.java:748)
      
      2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]
      java.io.IOException: Connection reset by peer
          at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
          at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
          at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
          at sun.nio.ch.IOUtil.write(IOUtil.java:65)
          at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
          at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
          at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
          at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
          at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
          at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
          at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
          at java.io.DataOutputStream.flush(DataOutputStream.java:123)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
          at java.lang.Thread.run(Thread.java:748)

      3. The data node terminates the receiving of block and initiates recovery:

      2021-01-25 03:42:22,526 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010]): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[172.21.246.239:50010] terminating
      
      2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 src: /172.21.240.31:60430 dest: /172.21.226.26:50010
      
      2021-01-25 03:42:22,529 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Recover RBW replica BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578

      4. The recovery on the datanode for blk_1342440165_272630578 fails with:

      2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): Exception for BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578
      java.nio.channels.ClosedByInterruptException
          at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
          at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
          at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
          at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
          at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
          at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
          at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
          at java.lang.Thread.run(Thread.java:748)
      
      2021-01-25 03:42:34,223 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): opWriteBlock BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578 received exception java.nio.channels.ClosedByInterruptException
      2021-01-25 03:42:34,223 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:45918 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272500939]): ip-172-21-226-26.ec2.internal:50010:DataXceiver error processing WRITE_BLOCK operation  src: /172.21.240.31:45918 dst: /172.21.226.26:50010
      java.nio.channels.ClosedByInterruptException
          at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
          at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269)
          at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.adjustCrcChannelPosition(FsDatasetImpl.java:1716)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.adjustCrcFilePosition(BlockReceiver.java:1052)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:733)
          at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
          at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
          at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
          at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
          at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
          at java.lang.Thread.run(Thread.java:748)

       

      5. This is followed by the resetting of bytes on the data node to 3072 the correct length, which is an expected behavior based on https://issues.apache.org/jira/browse/hdfs-11472:

      2021-01-25 03:42:40,007 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): At ip-172-21-226-26.ec2.internal:50010, Recovering ReplicaBeingWritten, blk_1342440165_272630578, RBW
        getNumBytes()     = 3072
        getBytesOnDisk()  = 2678
        getVisibleLength()= 3072
        getVolume()       = /mnt/hdfs/current
        getBlockFile()    = /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
        bytesAcked=3072
        bytesOnDisk=2678
      2021-01-25 03:42:40,011 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client DFSClient_attempt_16089610225137_1287341_r_000139_0_392781198_30 at /172.21.244.117:34354 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220]): Receiving BP-908477295-172.21.224.178-1606768078949:blk_1342762434_272648220 src: /172.21.244.117:34354 dest: /172.21.226.26:50010
      
      2021-01-25 03:42:40,013 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl (DataXceiver for client DFSClient_NONMAPREDUCE_390070548_42 at /172.21.240.31:60430 [Receiving block BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272630578]): Resetting bytesOnDisk to match blockDataLength (=3072) for replica ReplicaBeingWritten, blk_1342440165_272630578, RBW
        getNumBytes()     = 3072
        getBytesOnDisk()  = 2678
        getVisibleLength()= 3072
        getVolume()       = /mnt/hdfs/current
        getBlockFile()    = /mnt/hdfs/current/BP-908477295-172.21.224.178-1606768078949/current/rbw/blk_1342440165
        bytesAcked=3072
        bytesOnDisk=2678

      6. Since recovery failed, the Name Node restarts the updatePipeline for the block by increasing the generation:

      2021-01-25 03:42:40,034 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 8020): updatePipeline(blk_1342440165_272630578, newGS=272645069, newLength=3072, newNodes=[172.21.226.26:50010], client=DFSClient_NONMAPREDUCE_390070548_42)
      
      2021-01-25 03:42:40,034 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem (IPC Server handler 15 on 8020): updatePipeline(blk_1342440165_272630578 => blk_1342440165_272645069) success

      Notice how the correct newLength in sent in the above log from Name Node i.e. 3072.

      7. The data node then receives the new generation block of this block correctly:

      2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE): src: /172.21.240.31:60430, dest: /172.21.226.26:50010, bytes: 3554, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_390070548_42, offset: 0, srvID: 1a2ff556-73f5-4e10-b018-a0c2105b242e, blockid: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, duration: 240052941459
      
      2021-01-25 03:46:40,066 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE): PacketResponder: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069, type=LAST_IN_PIPELINE terminating

      But here notice how bytes: 3554 which is different from what was sent from the Name Node i.e. 3072.

      8. The data node then does the successful downstream replication into another node for this block:

      2021-01-25 03:46:43,093 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (BP-908477295-172.21.224.178-1606768078949 heartbeating to /172.21.238.225:8020): DatanodeRegistration(172.21.226.26:50010, datanodeUuid=1a2ff556-73f5-4e10-b018-a0c2105b242e, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-57;cid=CID-7608840c-3634-4377-9ba0-1c710e6d08d5;nsid=2011779031;c=1606768078949) Starting thread to transfer BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 to 172.21.234.181:50010 
      
      2021-01-25 03:46:43,094 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer@5cb29d58): DataTransfer, at ip-172-21-226-26.ec2.internal:50010: Transmitted BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 (numBytes=3554) to /172.21.234.181:5001

      9. However, immediately after successful transmitting and replication the namenode declares the block as corrupt:

      2021-01-25 03:46:43,106 INFO org.apache.hadoop.hdfs.StateChange (IPC Server handler 46 on 8020): *DIR* reportBadBlocks for block: BP-908477295-172.21.224.178-1606768078949:blk_1342440165_272645069 on datanode: 172.21.226.26:50010

       

      Any suggestions on what could cause this difference in the bytes communicated by the name node (3072) vs whats written on the the data node (3554) ?

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            uditme Udit Mehrotra
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: