Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9874

Long living DataXceiver threads cause volume shutdown to block.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 2.7.0
    • 2.8.0, 2.7.3, 3.0.0-alpha1
    • datanode
    • None
    • Reviewed

    Description

      One of the failed volume shutdown took 3 days to complete.
      Below are the relevant datanode logs while shutting down a volume (due to disk failure)

      2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing failed volume volumeA/current: 
      org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized
              at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194)
              at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
              at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
              at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
              at java.lang.Thread.run(Thread.java:745)
      2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23)
      2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO datanode.VolumeScanner: VolumeScanner(volumeA, DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting.
      2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN datanode.VolumeScanner: VolumeScanner(volumeA, DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b.
      java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp (Read-only file system)
              at java.io.FileOutputStream.open(Native Method)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
      
      2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current
      2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to write dfsUsed to volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
      java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only file system)
      		at java.io.FileOutputStream.open(Native Method)
      		at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
      		at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
      		at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
      		at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
      		at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
      		at java.lang.Thread.run(Thread.java:745)
      
      2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed volume: volumeA/current
      2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed checkDirs. Found 1 failure volumes.
      2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating volumes (clear failure=false): volumeA
      2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing volumeA from FsDataset.
      2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block level storage: volumeA/current/BP-1788428031-nnIp-1351700107344
      2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: DataNode.handleDiskError: Keep Running: true
      

      The datanode waits for the reference count to become zero before shutting down the volume.

      FsVolumeImpl.java
      while (this.reference.getReferenceCount() > 0) {
           if (FsDatasetImpl.LOG.isDebugEnabled()) {
             FsDatasetImpl.LOG.debug(String.format(
                 "The reference count for %s is %d, wait to be 0.",
                 this, reference.getReferenceCount()));
           }
           try {
             Thread.sleep(SLEEP_MILLIS);
           } catch (InterruptedException e) {
             throw new IOException(e);
           }
         }
      

      Just before datanode logged the following line,

       
      2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current
      

      I saw the following stack trace

      2016-02-24 16:05:53,017 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: Exception for BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
      java.io.IOException: Premature EOF from inputStream
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
              at java.lang.Thread.run(Thread.java:745)
      2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
      2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
      2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: opWriteBlock BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received exception java.io.IOException: Prematur
      e EOF from inputStream
      2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK operation  src: /upStreamDNIp:55710 dst: /thisDNIp
      :1004
      java.io.IOException: Premature EOF from inputStream
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
              at java.lang.Thread.run(Thread.java:745)
      
      

      On tracking the block blk_7059462432_1109832821906, I see that the block was created on 2016-02-17 15:06:28,256

      2016-02-17 15:06:28,928 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: Receiving BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src: /upStreamDNIp:55590 dest: /thisDNIp:1004
      

      The job which created this file was running for more than 7 days and the client eventually failed to renew the delegation token so the lease manager failed to renew lease for this file.
      Once commitBlockSynchronization kicked in, it tried to recover the block and eventually the DataXceiver thread died decrementing the reference count.

      Attachments

        1. HDFS-9874-trunk.patch
          6 kB
          Rushabh Shah
        2. HDFS-9874-trunk-1.patch
          7 kB
          Rushabh Shah
        3. HDFS-9874-trunk-2.patch
          7 kB
          Rushabh Shah

        Issue Links

          Activity

            People

              shahrs87 Rushabh Shah
              shahrs87 Rushabh Shah
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: