Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9874

Long living DataXceiver threads cause volume shutdown to block.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.7.0
    • Fix Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
    • Component/s: datanode
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      One of the failed volume shutdown took 3 days to complete.
      Below are the relevant datanode logs while shutting down a volume (due to disk failure)

      2016-02-21 10:12:55,333 [Thread-49277] WARN impl.FsDatasetImpl: Removing failed volume volumeA/current: 
      org.apache.hadoop.util.DiskChecker$DiskErrorException: Directory is not writable: volumeA/current/BP-1788428031-nnIp-1351700107344/current/finalized
              at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:194)
              at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
              at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.checkDirs(BlockPoolSlice.java:308)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.checkDirs(FsVolumeImpl.java:786)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:242)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
              at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
              at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
              at java.lang.Thread.run(Thread.java:745)
      2016-02-21 10:12:55,334 [Thread-49277] INFO datanode.BlockScanner: Removing scanner for volume volumeA (StorageID DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23)
      2016-02-21 10:12:55,334 [VolumeScannerThread(volumeA)] INFO datanode.VolumeScanner: VolumeScanner(volumeA, DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23) exiting.
      2016-02-21 10:12:55,335 [VolumeScannerThread(volumeA)] WARN datanode.VolumeScanner: VolumeScanner(volumeA, DS-cd2ea223-bab3-4361-a567-5f3f27a5dd23): error saving org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl@4169ad8b.
      java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/scanner.cursor.tmp (Read-only file system)
              at java.io.FileOutputStream.open(Native Method)
              at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
              at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl$BlockIteratorImpl.save(FsVolumeImpl.java:669)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.saveBlockIterator(VolumeScanner.java:314)
              at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
      
      2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current
      2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Failed to write dfsUsed to volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed
      java.io.FileNotFoundException: volumeA/current/BP-1788428031-nnIp-1351700107344/current/dfsUsed (Read-only file system)
      		at java.io.FileOutputStream.open(Native Method)
      		at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
      		at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.saveDfsUsed(BlockPoolSlice.java:247)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.shutdown(BlockPoolSlice.java:698)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.shutdown(FsVolumeImpl.java:815)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.removeVolume(FsVolumeList.java:328)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.checkDirs(FsVolumeList.java:250)
      		at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.checkDataDir(FsDatasetImpl.java:2011)
      		at org.apache.hadoop.hdfs.server.datanode.DataNode.checkDiskError(DataNode.java:3145)
      		at org.apache.hadoop.hdfs.server.datanode.DataNode.access$800(DataNode.java:243)
      		at org.apache.hadoop.hdfs.server.datanode.DataNode$7.run(DataNode.java:3178)
      		at java.lang.Thread.run(Thread.java:745)
      
      2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removed volume: volumeA/current
      2016-02-24 16:05:53,286 [Thread-49277] WARN impl.FsDatasetImpl: Completed checkDirs. Found 1 failure volumes.
      2016-02-24 16:05:53,286 [Thread-49277] INFO datanode.DataNode: Deactivating volumes (clear failure=false): volumeA
      2016-02-24 16:05:53,286 [Thread-49277] INFO impl.FsDatasetImpl: Removing volumeA from FsDataset.
      2016-02-24 16:05:53,342 [Thread-49277] INFO common.Storage: Removing block level storage: volumeA/current/BP-1788428031-nnIp-1351700107344
      2016-02-24 16:05:53,345 [Thread-49277] WARN datanode.DataNode: DataNode.handleDiskError: Keep Running: true
      

      The datanode waits for the reference count to become zero before shutting down the volume.

      FsVolumeImpl.java
      while (this.reference.getReferenceCount() > 0) {
           if (FsDatasetImpl.LOG.isDebugEnabled()) {
             FsDatasetImpl.LOG.debug(String.format(
                 "The reference count for %s is %d, wait to be 0.",
                 this, reference.getReferenceCount()));
           }
           try {
             Thread.sleep(SLEEP_MILLIS);
           } catch (InterruptedException e) {
             throw new IOException(e);
           }
         }
      

      Just before datanode logged the following line,

       
      2016-02-24 16:05:53,285 [Thread-49277] WARN impl.FsDatasetImpl: Failed to delete old dfsUsed file in volumeA/current/BP-1788428031-nnIp-1351700107344/current
      

      I saw the following stack trace

      2016-02-24 16:05:53,017 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: Exception for BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736
      java.io.IOException: Premature EOF from inputStream
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
              at java.lang.Thread.run(Thread.java:745)
      2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
      2016-02-24 16:05:53,018 [PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[]] INFO datanode.DataNode: PacketResponder: BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
      2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: opWriteBlock BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832909736 received exception java.io.IOException: Prematur
      e EOF from inputStream
      2016-02-24 16:05:53,018 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55710 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] ERROR datanode.DataNode: thisDNName:1004:DataXceiver error processing WRITE_BLOCK operation  src: /upStreamDNIp:55710 dst: /thisDNIp
      :1004
      java.io.IOException: Premature EOF from inputStream
              at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
              at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:501)
              at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:895)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
              at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
              at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253)
              at java.lang.Thread.run(Thread.java:745)
      
      

      On tracking the block blk_7059462432_1109832821906, I see that the block was created on 2016-02-17 15:06:28,256

      2016-02-17 15:06:28,928 [DataXceiver for client DFSClient_NONMAPREDUCE_1651663681_1 at /upStreamDNIp:55590 [Receiving block BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906]] INFO datanode.DataNode: Receiving BP-1788428031-nnIp-1351700107344:blk_7059462432_1109832821906 src: /upStreamDNIp:55590 dest: /thisDNIp:1004
      

      The job which created this file was running for more than 7 days and the client eventually failed to renew the delegation token so the lease manager failed to renew lease for this file.
      Once commitBlockSynchronization kicked in, it tried to recover the block and eventually the DataXceiver thread died decrementing the reference count.

      1. HDFS-9874-trunk.patch
        6 kB
        Rushabh S Shah
      2. HDFS-9874-trunk-1.patch
        7 kB
        Rushabh S Shah
      3. HDFS-9874-trunk-2.patch
        7 kB
        Rushabh S Shah

        Issue Links

          Activity

          Hide
          shahrs87 Rushabh S Shah added a comment -

          This patch goes through the volume map and stops the writer thread if the replica object is an instance of ReplicaInPipeline.

          Show
          shahrs87 Rushabh S Shah added a comment - This patch goes through the volume map and stops the writer thread if the replica object is an instance of ReplicaInPipeline.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 11s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 7m 35s trunk passed
          +1 compile 0m 44s trunk passed with JDK v1.8.0_72
          +1 compile 0m 45s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 55s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          +1 findbugs 2m 11s trunk passed
          +1 javadoc 1m 13s trunk passed with JDK v1.8.0_72
          +1 javadoc 1m 53s trunk passed with JDK v1.7.0_95
          +1 mvninstall 0m 50s the patch passed
          +1 compile 0m 42s the patch passed with JDK v1.8.0_72
          +1 javac 0m 42s the patch passed
          +1 compile 0m 43s the patch passed with JDK v1.7.0_95
          +1 javac 0m 43s the patch passed
          +1 checkstyle 0m 20s the patch passed
          +1 mvnsite 0m 54s the patch passed
          +1 mvneclipse 0m 12s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 2m 25s the patch passed
          +1 javadoc 1m 10s the patch passed with JDK v1.8.0_72
          +1 javadoc 1m 56s the patch passed with JDK v1.7.0_95
          -1 unit 65m 57s hadoop-hdfs in the patch failed with JDK v1.8.0_72.
          -1 unit 57m 28s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 22s Patch does not generate ASF License warnings.
          151m 20s



          Reason Tests
          JDK v1.8.0_72 Failed junit tests hadoop.hdfs.server.namenode.TestNamenodeCapacityReport
            hadoop.hdfs.shortcircuit.TestShortCircuitCache
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.TestEditLog
            hadoop.hdfs.server.datanode.TestFsDatasetCache



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791200/HDFS-9874-trunk.patch
          JIRA Issue HDFS-9874
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux bf00281a78f7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 0a9f00a
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14705/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14705/console
          Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 35s trunk passed +1 compile 0m 44s trunk passed with JDK v1.8.0_72 +1 compile 0m 45s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 55s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 2m 11s trunk passed +1 javadoc 1m 13s trunk passed with JDK v1.8.0_72 +1 javadoc 1m 53s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 50s the patch passed +1 compile 0m 42s the patch passed with JDK v1.8.0_72 +1 javac 0m 42s the patch passed +1 compile 0m 43s the patch passed with JDK v1.7.0_95 +1 javac 0m 43s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 54s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 25s the patch passed +1 javadoc 1m 10s the patch passed with JDK v1.8.0_72 +1 javadoc 1m 56s the patch passed with JDK v1.7.0_95 -1 unit 65m 57s hadoop-hdfs in the patch failed with JDK v1.8.0_72. -1 unit 57m 28s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 22s Patch does not generate ASF License warnings. 151m 20s Reason Tests JDK v1.8.0_72 Failed junit tests hadoop.hdfs.server.namenode.TestNamenodeCapacityReport   hadoop.hdfs.shortcircuit.TestShortCircuitCache JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.TestEditLog   hadoop.hdfs.server.datanode.TestFsDatasetCache Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12791200/HDFS-9874-trunk.patch JIRA Issue HDFS-9874 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux bf00281a78f7 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 0a9f00a Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_72 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_72.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14705/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14705/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14705/console Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          I ran all the failed tests on jdk7 (jdk1.7.0_71) and jdk8 (jdk1.8.0_45).
          None of the tests failed on my machine.

          Show
          shahrs87 Rushabh S Shah added a comment - I ran all the failed tests on jdk7 (jdk1.7.0_71) and jdk8 (jdk1.8.0_45). None of the tests failed on my machine.
          Hide
          daryn Daryn Sharp added a comment -

          The synchronization on FSDatasetImpl#stopAllDataxceiverThreads is a bit concerning. Stopping xceiver threads uses a default timeout of 1min. That's a long time for the DN to block if threads don't exit immediately.

          The iteration of replicas might not be safe. The correct locking model isn't immediately clear but ReplicaMap#replicas has the comment which other code doesn't appear to follow:

            /**
             * Get a collection of the replicas for given block pool
             * This method is <b>not synchronized</b>. It needs to be synchronized
             * externally using the mutex, both for getting the replicas
             * values from the map and iterating over it. Mutex can be accessed using
             * {@link #getMutext()} method.
          

          Might need to consider forcibly decrementing the ref and interrupting with no timeout.

          For the test, I'd assert the volume actually has a non-zero ref count before trying to interrupt. Instead of triggering an async check and sleeping, which inevitable creates flaky race conditions, the disk check should be invoked non-async. Should verify that the client stream fails after the volume is failed.

          Show
          daryn Daryn Sharp added a comment - The synchronization on FSDatasetImpl#stopAllDataxceiverThreads is a bit concerning. Stopping xceiver threads uses a default timeout of 1min. That's a long time for the DN to block if threads don't exit immediately. The iteration of replicas might not be safe. The correct locking model isn't immediately clear but ReplicaMap#replicas has the comment which other code doesn't appear to follow: /** * Get a collection of the replicas for given block pool * This method is <b>not synchronized</b>. It needs to be synchronized * externally using the mutex, both for getting the replicas * values from the map and iterating over it. Mutex can be accessed using * {@link #getMutext()} method. Might need to consider forcibly decrementing the ref and interrupting with no timeout. For the test, I'd assert the volume actually has a non-zero ref count before trying to interrupt. Instead of triggering an async check and sleeping, which inevitable creates flaky race conditions, the disk check should be invoked non-async. Should verify that the client stream fails after the volume is failed.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Cancelling the patch to address Daryn's comment.

          Show
          shahrs87 Rushabh S Shah added a comment - Cancelling the patch to address Daryn's comment.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Thanks Daryn for the valuable comments.

          The synchronization on FSDatasetImpl#stopAllDataxceiverThreads is a bit concerning. Stopping xceiver threads uses a default timeout of 1min. That's a long time for the DN to block if threads don't exit immediately.

          Addressed the issue by interrupting the BlockReceiver thread.

          The iteration of replicas might not be safe. The correct locking model isn't immediately clear but ReplicaMap#replicas has the comment which other code doesn't appear to follow:

          Since all the calls to ReplicaMap#replicas are synchronized on FsDatasetImpl class, I did the same way.

          For the test, I'd assert the volume actually has a non-zero ref count before trying to interrupt. Instead of triggering an async check and sleeping, which inevitable creates flaky race conditions, the disk check should be invoked non-async. Should verify that the client stream fails after the volume is failed.

          That's a good suggestion to write good test cases. Thanks a lot.
          Addressed all the comments in this section.
          Please review the revised patch.

          Show
          shahrs87 Rushabh S Shah added a comment - Thanks Daryn for the valuable comments. The synchronization on FSDatasetImpl#stopAllDataxceiverThreads is a bit concerning. Stopping xceiver threads uses a default timeout of 1min. That's a long time for the DN to block if threads don't exit immediately. Addressed the issue by interrupting the BlockReceiver thread. The iteration of replicas might not be safe. The correct locking model isn't immediately clear but ReplicaMap#replicas has the comment which other code doesn't appear to follow: Since all the calls to ReplicaMap#replicas are synchronized on FsDatasetImpl class, I did the same way. For the test, I'd assert the volume actually has a non-zero ref count before trying to interrupt. Instead of triggering an async check and sleeping, which inevitable creates flaky race conditions, the disk check should be invoked non-async. Should verify that the client stream fails after the volume is failed. That's a good suggestion to write good test cases. Thanks a lot. Addressed all the comments in this section. Please review the revised patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 9m 35s trunk passed
          +1 compile 0m 52s trunk passed with JDK v1.8.0_74
          +1 compile 0m 41s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 21s trunk passed
          +1 mvnsite 0m 54s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 2m 2s trunk passed
          +1 javadoc 1m 26s trunk passed with JDK v1.8.0_74
          +1 javadoc 2m 13s trunk passed with JDK v1.7.0_95
          +1 mvninstall 0m 56s the patch passed
          +1 compile 0m 57s the patch passed with JDK v1.8.0_74
          +1 javac 0m 57s the patch passed
          +1 compile 1m 5s the patch passed with JDK v1.7.0_95
          +1 javac 1m 5s the patch passed
          -1 checkstyle 0m 30s hadoop-hdfs-project/hadoop-hdfs: patch generated 1 new + 145 unchanged - 0 fixed = 146 total (was 145)
          +1 mvnsite 1m 21s the patch passed
          +1 mvneclipse 0m 19s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 3m 12s the patch passed
          +1 javadoc 1m 31s the patch passed with JDK v1.8.0_74
          +1 javadoc 2m 25s the patch passed with JDK v1.7.0_95
          -1 unit 75m 15s hadoop-hdfs in the patch failed with JDK v1.8.0_74.
          -1 unit 70m 24s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 26s Patch does not generate ASF License warnings.
          180m 17s



          Reason Tests
          JDK v1.8.0_74 Failed junit tests hadoop.hdfs.server.datanode.TestTriggerBlockReport
            hadoop.hdfs.server.namenode.TestEditLog
            hadoop.hdfs.TestFileAppend
            hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA
            hadoop.hdfs.server.namenode.TestNamenodeCapacityReport
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.TestEditLog
            hadoop.hdfs.TestDFSUpgradeFromImage
            hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12792352/HDFS-9874-trunk-1.patch
          JIRA Issue HDFS-9874
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ee85e8c1fa27 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 2e040d3
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14764/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14764/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 9m 35s trunk passed +1 compile 0m 52s trunk passed with JDK v1.8.0_74 +1 compile 0m 41s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 54s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 2m 2s trunk passed +1 javadoc 1m 26s trunk passed with JDK v1.8.0_74 +1 javadoc 2m 13s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 56s the patch passed +1 compile 0m 57s the patch passed with JDK v1.8.0_74 +1 javac 0m 57s the patch passed +1 compile 1m 5s the patch passed with JDK v1.7.0_95 +1 javac 1m 5s the patch passed -1 checkstyle 0m 30s hadoop-hdfs-project/hadoop-hdfs: patch generated 1 new + 145 unchanged - 0 fixed = 146 total (was 145) +1 mvnsite 1m 21s the patch passed +1 mvneclipse 0m 19s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 3m 12s the patch passed +1 javadoc 1m 31s the patch passed with JDK v1.8.0_74 +1 javadoc 2m 25s the patch passed with JDK v1.7.0_95 -1 unit 75m 15s hadoop-hdfs in the patch failed with JDK v1.8.0_74. -1 unit 70m 24s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 26s Patch does not generate ASF License warnings. 180m 17s Reason Tests JDK v1.8.0_74 Failed junit tests hadoop.hdfs.server.datanode.TestTriggerBlockReport   hadoop.hdfs.server.namenode.TestEditLog   hadoop.hdfs.TestFileAppend   hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA   hadoop.hdfs.server.namenode.TestNamenodeCapacityReport JDK v1.7.0_95 Failed junit tests hadoop.hdfs.server.namenode.TestEditLog   hadoop.hdfs.TestDFSUpgradeFromImage   hadoop.metrics2.sink.TestRollingFileSystemSinkWithSecureHdfs Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12792352/HDFS-9874-trunk-1.patch JIRA Issue HDFS-9874 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ee85e8c1fa27 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2e040d3 Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_74.txt https://builds.apache.org/job/PreCommit-HDFS-Build/14764/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14764/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14764/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Cancelling the patch to address checkstyle warning.

          Show
          shahrs87 Rushabh S Shah added a comment - Cancelling the patch to address checkstyle warning.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Fixed checkstyle warning.

          Show
          shahrs87 Rushabh S Shah added a comment - Fixed checkstyle warning.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Ran all the failed tests on both jdk7 and jdk8.
          All of them passed.

          Show
          shahrs87 Rushabh S Shah added a comment - Ran all the failed tests on both jdk7 and jdk8. All of them passed.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 10s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 33s trunk passed
          +1 compile 0m 40s trunk passed with JDK v1.8.0_74
          +1 compile 0m 41s trunk passed with JDK v1.7.0_95
          +1 checkstyle 0m 22s trunk passed
          +1 mvnsite 0m 51s trunk passed
          +1 mvneclipse 0m 13s trunk passed
          +1 findbugs 1m 54s trunk passed
          +1 javadoc 1m 6s trunk passed with JDK v1.8.0_74
          +1 javadoc 1m 46s trunk passed with JDK v1.7.0_95
          +1 mvninstall 0m 47s the patch passed
          +1 compile 0m 36s the patch passed with JDK v1.8.0_74
          +1 javac 0m 36s the patch passed
          +1 compile 0m 38s the patch passed with JDK v1.7.0_95
          +1 javac 0m 38s the patch passed
          +1 checkstyle 0m 20s the patch passed
          +1 mvnsite 0m 48s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 findbugs 2m 10s the patch passed
          +1 javadoc 1m 5s the patch passed with JDK v1.8.0_74
          +1 javadoc 1m 44s the patch passed with JDK v1.7.0_95
          +1 unit 54m 40s hadoop-hdfs in the patch passed with JDK v1.8.0_74.
          -1 unit 53m 23s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
          +1 asflicense 0m 20s Patch does not generate ASF License warnings.
          133m 4s



          Reason Tests
          JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush
            hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:0ca8df7
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12792577/HDFS-9874-trunk-2.patch
          JIRA Issue HDFS-9874
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 761ed470b5bb 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 79961ec
          Default Java 1.7.0_95
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
          findbugs v3.0.0
          whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/14783/artifact/patchprocess/whitespace-eol.txt
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/14783/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14783/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
          JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14783/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14783/console
          Powered by Apache Yetus 0.2.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 10s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 33s trunk passed +1 compile 0m 40s trunk passed with JDK v1.8.0_74 +1 compile 0m 41s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 54s trunk passed +1 javadoc 1m 6s trunk passed with JDK v1.8.0_74 +1 javadoc 1m 46s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 47s the patch passed +1 compile 0m 36s the patch passed with JDK v1.8.0_74 +1 javac 0m 36s the patch passed +1 compile 0m 38s the patch passed with JDK v1.7.0_95 +1 javac 0m 38s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 48s the patch passed +1 mvneclipse 0m 11s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 findbugs 2m 10s the patch passed +1 javadoc 1m 5s the patch passed with JDK v1.8.0_74 +1 javadoc 1m 44s the patch passed with JDK v1.7.0_95 +1 unit 54m 40s hadoop-hdfs in the patch passed with JDK v1.8.0_74. -1 unit 53m 23s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 20s Patch does not generate ASF License warnings. 133m 4s Reason Tests JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush   hadoop.hdfs.server.namenode.ha.TestDFSUpgradeWithHA Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12792577/HDFS-9874-trunk-2.patch JIRA Issue HDFS-9874 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 761ed470b5bb 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 79961ec Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_74 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 whitespace https://builds.apache.org/job/PreCommit-HDFS-Build/14783/artifact/patchprocess/whitespace-eol.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/14783/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/14783/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/14783/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/14783/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
          Hide
          kihwal Kihwal Lee added a comment -

          This patch will kick out writers even for "graceful" removals. But when a drive is remove, we probably don't want long-living writers to block the maintenance. So I think it is still acceptable.
          +1 the patch looks good.

          Show
          kihwal Kihwal Lee added a comment - This patch will kick out writers even for "graceful" removals. But when a drive is remove, we probably don't want long-living writers to block the maintenance. So I think it is still acceptable. +1 the patch looks good.
          Hide
          kihwal Kihwal Lee added a comment -

          Committed to trunk through branch-2.7. Thanks for analyzing and fixing the issue, Rushabh S Shah.

          Show
          kihwal Kihwal Lee added a comment - Committed to trunk through branch-2.7. Thanks for analyzing and fixing the issue, Rushabh S Shah .
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9474 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9474/)
          HDFS-9874. Long living DataXceiver threads cause volume shutdown to (kihwal: rev 63c966a3fbeb675959fc4101e65de9f57aecd17d)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/ReplicaInPipeline.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9474 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9474/ ) HDFS-9874 . Long living DataXceiver threads cause volume shutdown to (kihwal: rev 63c966a3fbeb675959fc4101e65de9f57aecd17d) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/TestFsDatasetImpl.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsVolumeImpl.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/FsDatasetImpl.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/ReplicaInPipeline.java
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Thanks Kihwal Lee for reviews and committing and Daryn Sharp for excellent reviews.

          Show
          shahrs87 Rushabh S Shah added a comment - Thanks Kihwal Lee for reviews and committing and Daryn Sharp for excellent reviews.
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          It seems this patch is buggy.

          In a precommit job, this test threw NPE:
          https://builds.apache.org/job/PreCommit-HDFS-Build/14881/testReport/org.apache.hadoop.hdfs.server.datanode.fsdataset.impl/TestFsDatasetImpl/testCleanShutdownOfVolume/

          Exception in thread "DataNode: [[[DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/3/dfs/data/data1/, [DISK]file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/3/dfs/data/data2/]] heartbeating to localhost/127.0.0.1:39740" java.lang.NullPointerException
          at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1714)
          at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.shutdownBlockPool(FsDatasetImpl.java:2591)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownBlockPool(DataNode.java:1479)
          at org.apache.hadoop.hdfs.server.datanode.BPOfferService.shutdownActor(BPOfferService.java:411)
          at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.cleanUp(BPServiceActor.java:494)
          at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:749)
          at java.lang.Thread.run(Thread.java:745)

          And the precommit record shows it has been failing continuously for 3 times.

          Show
          jojochuang Wei-Chiu Chuang added a comment - It seems this patch is buggy. In a precommit job, this test threw NPE: https://builds.apache.org/job/PreCommit-HDFS-Build/14881/testReport/org.apache.hadoop.hdfs.server.datanode.fsdataset.impl/TestFsDatasetImpl/testCleanShutdownOfVolume/ Exception in thread "DataNode: [[ [DISK] file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/3/dfs/data/data1/, [DISK] file:/testptch/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/3/dfs/data/data2/]] heartbeating to localhost/127.0.0.1:39740" java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1714) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.shutdownBlockPool(FsDatasetImpl.java:2591) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownBlockPool(DataNode.java:1479) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.shutdownActor(BPOfferService.java:411) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.cleanUp(BPServiceActor.java:494) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:749) at java.lang.Thread.run(Thread.java:745) And the precommit record shows it has been failing continuously for 3 times.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Wei-Chiu Chuang: Thanks for reporting. Taking a look now.

          Show
          shahrs87 Rushabh S Shah added a comment - Wei-Chiu Chuang : Thanks for reporting. Taking a look now.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          The NPE is expected.

          at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1714)
          at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.shutdownBlockPool(FsDatasetImpl.java:2591)
          at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownBlockPool(DataNode.java:1479)
          at org.apache.hadoop.hdfs.server.datanode.BPOfferService.shutdownActor(BPOfferService.java:411)

          This is getting called while shutting down the cluster.
          This is expected since I have triggered only a part of checkDiskError thread.

          DataNode.java
           private void checkDiskError() {
              Set<File> unhealthyDataDirs = data.checkDataDir();
              if (unhealthyDataDirs != null && !unhealthyDataDirs.isEmpty()) {
                try {
                  // Remove all unhealthy volumes from DataNode.
                  removeVolumes(unhealthyDataDirs, false);
                } catch (IOException e) { 
                  LOG.warn("Error occurred when removing unhealthy storage dirs: "
                      + e.getMessage(), e);
                }    
                StringBuilder sb = new StringBuilder("DataNode failed volumes:");
                for (File dataDir : unhealthyDataDirs) {
                  sb.append(dataDir.getAbsolutePath() + ";");
                }    
                handleDiskError(sb.toString());
              }    
            }
          

          I have only called the first line of the above function in the test case since I don't want the test case to wait for DataNode#checkDiskErrorInterval (which is 5 secs if defualt).
          That's why it will not execute removeVolumes(unhealthyDataDirs, false)
          Therefore the NPE.

          I am not able to reproduce the test case failing on my local machine on jdk 7 and jdk8.
          Wei-Chiu Chuang: Does it fail on your machine ?

          Show
          shahrs87 Rushabh S Shah added a comment - The NPE is expected. at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getBlockReports(FsDatasetImpl.java:1714) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.shutdownBlockPool(FsDatasetImpl.java:2591) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownBlockPool(DataNode.java:1479) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.shutdownActor(BPOfferService.java:411) This is getting called while shutting down the cluster. This is expected since I have triggered only a part of checkDiskError thread. DataNode.java private void checkDiskError() { Set<File> unhealthyDataDirs = data.checkDataDir(); if (unhealthyDataDirs != null && !unhealthyDataDirs.isEmpty()) { try { // Remove all unhealthy volumes from DataNode. removeVolumes(unhealthyDataDirs, false ); } catch (IOException e) { LOG.warn( "Error occurred when removing unhealthy storage dirs: " + e.getMessage(), e); } StringBuilder sb = new StringBuilder( "DataNode failed volumes:" ); for (File dataDir : unhealthyDataDirs) { sb.append(dataDir.getAbsolutePath() + ";" ); } handleDiskError(sb.toString()); } } I have only called the first line of the above function in the test case since I don't want the test case to wait for DataNode#checkDiskErrorInterval (which is 5 secs if defualt). That's why it will not execute removeVolumes(unhealthyDataDirs, false) Therefore the NPE. I am not able to reproduce the test case failing on my local machine on jdk 7 and jdk8. Wei-Chiu Chuang : Does it fail on your machine ?
          Hide
          jojochuang Wei-Chiu Chuang added a comment -

          Thanks for looking into it. Maybe the NPE is unrelated.
          I'm not able to fail the test, it could be an intermittent flaky test.
          But in anyway, it would be great if you could improve the test diagnostics using GenericTestUtils#assertExceptionContains. This utility method prints the stack trace if the exception message doesn't match the expected value.

          Show
          jojochuang Wei-Chiu Chuang added a comment - Thanks for looking into it. Maybe the NPE is unrelated. I'm not able to fail the test, it could be an intermittent flaky test. But in anyway, it would be great if you could improve the test diagnostics using GenericTestUtils#assertExceptionContains . This utility method prints the stack trace if the exception message doesn't match the expected value.
          Hide
          shahrs87 Rushabh S Shah added a comment -

          Sure. Will update the patch shortly.

          Show
          shahrs87 Rushabh S Shah added a comment - Sure. Will update the patch shortly.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing the JIRA as part of 2.7.3 release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

            People

            • Assignee:
              shahrs87 Rushabh S Shah
              Reporter:
              shahrs87 Rushabh S Shah
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development