Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
2.7.0
-
None
-
Reviewed
Description
One of the failed volume shutdown took 3 days to complete.
Below are the relevant datanode logs while shutting down a volume (due to disk failure)
2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing failed volume volumeA/current: org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194) at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011) at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243) at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178) at java.lang.Thread.run(Thread.java:745) 2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) 2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO datanode.VolumeScanner: VolumeScanner(volumeA, DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting. 2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN datanode.VolumeScanner: VolumeScanner(volumeA, DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b. java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp (Read-only file system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) 2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to write dfsUsed to volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only file system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011) at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145) at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243) at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178) at java.lang.Thread.run(Thread.java:745) 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed volume: volumeA/current 2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed checkDirs. Found 1 failure volumes. 2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating volumes (clear failure=false): volumeA 2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing volumeA from FsDataset. 2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block level storage: volumeA/current/BP-1788428031-nnIp-1351700107344 2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: DataNode.handleDiskError: Keep Running: true
The datanode waits for the reference count to become zero before shutting down the volume.
while (this.reference.getReferenceCount() > 0) { if (FsDatasetImpl.LOG.isDebugEnabled()) { FsDatasetImpl.LOG.debug(String.format( "The reference count for %s is %d, wait to be 0.", this, reference.getReferenceCount())); } try { Thread.sleep(SLEEP_MILLIS); } catch (InterruptedException e) { throw new IOException(e); } }
Just before datanode logged the following line,
2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current
I saw the following stack trace
2016-02-24 16:05:53,017 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: Exception for BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) at java.lang.Thread.run(Thread.java:745) 2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted. 2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: opWriteBlock BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received exception java.io.IOException: Prematur e EOF from inputStream 2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK operation src: /upStreamDNIp:55710 dst: /thisDNIp :1004 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) at java.lang.Thread.run(Thread.java:745)
On tracking the block blk_7059462432_1109832821906, I see that the block was created on 2016-02-17 15:06:28,256
2016-02-17 15:06:28,928 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: Receiving BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src: /upStreamDNIp:55590 dest: /thisDNIp:1004
The job which created this file was running for more than 7 days and the client eventually failed to renew the delegation token so the lease manager failed to renew lease for this file.
Once commitBlockSynchronization kicked in, it tried to recover the block and eventually the DataXceiver thread died decrementing the reference count.
Attachments
Attachments
Issue Links
- is broken by
-
HDFS-10260 TestFsDatasetImpl#testCleanShutdownOfVolume often fails
- Resolved
- is related to
-
HDFS-10605 Can not synchronized call method of object and Mockito.spy(object), So UT:testRemoveVolumeBeingWritten passed but maybe deadlock online
- Patch Available