Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10927

Lease Recovery: File not getting closed on HDFS when block write operation fails

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 2.7.1
    • Fix Version/s: None
    • Component/s: fs
    • Labels:
      None

      Description

      HDFS was unable to close a file when block write operation failed because of too high disk usage.

      Scenario:

      HBase was writing WAL logs on HDFS and the disk usage was too high at that time. While writing these WAL logs, one of the blocks writes operation failed with the following exception:

      2016-09-13 10:00:49,978 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1160899
      java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.194.144:50010 remote=/192.168.192.162:43105]
      at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
      at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
      at java.io.BufferedInputStream.fill(Unknown Source)
      at java.io.BufferedInputStream.read1(Unknown Source)
      at java.io.BufferedInputStream.read(Unknown Source)
      at java.io.DataInputStream.read(Unknown Source)
      at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
      at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
      at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
      at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
      at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
      at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:807)
      at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
      at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
      at java.lang.Thread.run(Unknown Source)

      After this exception, HBase tried to close/rollover the WAL file but that call also failed and WAL file couldn't be closed. After this HBase closed the region server

      After some time, Lease Recovery got triggered for this file and following exceptions starts occurring:

      2016-09-13 11:51:11,743 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to obtain replica info for block (=BP-337226066-192.168.193.217-1468912147102:blk_1074859607_1161187) from datanode (=DatanodeInfoWithStorage[192.168.192.162:50010,null,null])
      java.io.IOException: THIS IS NOT SUPPOSED TO HAPPEN: getBytesOnDisk() < getVisibleLength(), rip=ReplicaBeingWritten, blk_1074859607_1161187, RBW
      getNumBytes() = 45524696
      getBytesOnDisk() = 45483527
      getVisibleLength()= 45511557
      getVolume() = /opt/reflex/data/yarn/datanode/current
      getBlockFile() = /opt/reflex/data/yarn/datanode/current/BP-337226066-192.168.193.217-1468912147102/current/rbw/blk_1074859607
      bytesAcked=45511557
      bytesOnDisk=45483527
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2278)
      at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:2254)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2542)
      at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolServerSideTranslatorPB.initReplicaRecovery(InterDatanodeProtocolServerSideTranslatorPB.java:55)
      at org.apache.hadoop.hdfs.protocol.proto.InterDatanodeProtocolProtos$InterDatanodeProtocolService$2.callBlockingMethod(InterDatanodeProtocolProtos.java:3105)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Unknown Source)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
      at java.lang.reflect.Constructor.newInstance(Unknown Source)
      at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
      at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2555)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2625)
      at org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:243)
      at org.apache.hadoop.hdfs.server.datanode.DataNode$5.run(DataNode.java:2527)
      at java.lang.Thread.run(Unknown Source)

      Expected Behaviour: Under all conditions lease recovery should have been done and file should have been closed.

      Impact: Since the file couldn't be closed, HBase went into an incosistent state as it wasn't able to run through the WAL file after the region server restart.

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ngoswami Nitin Goswami

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment