Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha2
    • Component/s: datanode, test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      TestDataNodeVolumeFailure#testVolumeFailure fails a volume and verifies the blocks and files are replicated correctly.

      1. To fail a volume, it deletes all the blocks and sets the data dir read only.
        testVolumeFailure() snippet
            // fail the volume
            // delete/make non-writable one of the directories (failed volume)
            data_fail = new File(dataDir, "data3");
            failedDir = MiniDFSCluster.getFinalizedDir(dataDir, 
                cluster.getNamesystem().getBlockPoolId());
            if (failedDir.exists() &&
                //!FileUtil.fullyDelete(failedDir)
                !deteteBlocks(failedDir)
                ) {
              throw new IOException("Could not delete hdfs directory '" + failedDir + "'");
            }
            data_fail.setReadOnly();
            failedDir.setReadOnly();
        

        However, there are two bugs here, which make the blocks not deleted.

        • The failedDir directory for finalized blocks is not calculated correctly. It should use data_fail instead of dataDir as the base directory.
        • When deleting block files in deteteBlocks(failedDir), it assumes that there is no subdirectories in the data dir. This assumption was also noted in the comments.

          // we use only small number of blocks to avoid creating subdirs in the data dir..

          This is not true. On my local cluster and MiniDFSCluster, there will be subdir0/subdir0/ two level directories regardless of the number of blocks.

      2. Meanwhile, to fail a volume, it also needs to trigger the DataNode removing the volume and send block report to NN. This is basically in the triggerFailure() method.
          private void triggerFailure(String path, long size) throws IOException {
            NamenodeProtocols nn = cluster.getNameNodeRpc();
            List<LocatedBlock> locatedBlocks =
              nn.getBlockLocations(path, 0, size).getLocatedBlocks();
            
            for (LocatedBlock lb : locatedBlocks) {
              DatanodeInfo dinfo = lb.getLocations()[1];
              ExtendedBlock b = lb.getBlock();
              try {
                accessBlock(dinfo, lb);
              } catch (IOException e) {
                System.out.println("Failure triggered, on block: " + b.getBlockId() +  
                    "; corresponding volume should be removed by now");
                break;
              }
            }
          }
        

        Accessing those blocks will not trigger failures if the directory is read-only (while the block files are all there). I ran the tests multiple times without triggering this failure. We have to write new block files to the data directories, or we should have deleted the blocks correctly. I think we need to add some assertion code after triggering the volume failure. The assertions should check the datanode volume failure summary explicitly to make sure a volume failure is triggered (and noticed).

      3. To make sure the NameNode be aware of the volume failure, the code explictily send block reports to NN.
        TestDataNodeVolumeFailure#testVolumeFailure()
            cluster.getNameNodeRpc().blockReport(dnR, bpid, reports,
                new BlockReportContext(1, 0, System.nanoTime(), 0, false));
        

        Generating block report code is complex, which is actually the internal logic of BPServiceActor. We may have to update this code it changes. In fact, the volume failure is now sent by DataNode via heartbeats. We should trigger a heartbeat request here; and make sure the NameNode handles the heartbeat before we verify the block states.

      4. When verifying via verify(), it counts the real block files and assert that real block files plus underreplicated blocks should cover all blocks. Before counting underreplicated blocks, it triggered the BlockManager to compute the datanode work:
            // force update of all the metric counts by calling computeDatanodeWork
            BlockManagerTestUtil.getComputedDatanodeWork(fsn.getBlockManager());
        

        However, counting physical block files and underreplicated blocks are not atomic. The NameNode will inform of the DataNode the computed work at next heartbeat. So I think this part of code may fail when some blocks are replicated and the number of physical block files is made stale. To avoid this case, I think we should keep the DataNode from sending the heartbeat after that. A simple solution is to set dfs.heartbeat.interval long enough.

      This unit test has been there for years and it seldom fails, just because it's never triggered a real volume failure.

      1. HDFS-11030.000.patch
        7 kB
        Mingliang Liu
      2. HDFS-11030-branch-2.000.patch
        7 kB
        Mingliang Liu

        Activity

        Hide
        liuml07 Mingliang Liu added a comment -

        Anyone can review the JIRA and the patch? Thanks,

        Ping Jitendra Nath Pandey and Arpit Agarwal.

        Show
        liuml07 Mingliang Liu added a comment - Anyone can review the JIRA and the patch? Thanks, Ping Jitendra Nath Pandey and Arpit Agarwal .
        Hide
        hadoopqa Hadoop QA added a comment -
        +1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 51s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 39s branch-2 passed
        +1 compile 0m 38s branch-2 passed with JDK v1.8.0_101
        +1 compile 0m 44s branch-2 passed with JDK v1.7.0_111
        +1 checkstyle 0m 28s branch-2 passed
        +1 mvnsite 0m 52s branch-2 passed
        +1 mvneclipse 0m 16s branch-2 passed
        +1 findbugs 1m 57s branch-2 passed
        +1 javadoc 0m 54s branch-2 passed with JDK v1.8.0_101
        +1 javadoc 1m 34s branch-2 passed with JDK v1.7.0_111
        +1 mvninstall 0m 43s the patch passed
        +1 compile 0m 36s the patch passed with JDK v1.8.0_101
        +1 javac 0m 36s the patch passed
        +1 compile 0m 40s the patch passed with JDK v1.7.0_111
        +1 javac 0m 40s the patch passed
        +1 checkstyle 0m 24s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 50 unchanged - 7 fixed = 50 total (was 57)
        +1 mvnsite 0m 49s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 2m 9s the patch passed
        +1 javadoc 0m 50s the patch passed with JDK v1.8.0_101
        +1 javadoc 1m 32s the patch passed with JDK v1.7.0_111
        +1 unit 48m 20s hadoop-hdfs in the patch passed with JDK v1.7.0_111.
        +1 asflicense 0m 22s The patch does not generate ASF License warnings.
        134m 54s



        Reason Tests
        JDK v1.8.0_101 Failed junit tests hadoop.hdfs.TestEncryptionZones



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:b59b8b7
        JIRA Issue HDFS-11030
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834547/HDFS-11030-branch-2.000.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 2ebe64861e73 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision branch-2 / 1f384b6
        Default Java 1.7.0_111
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
        findbugs v3.0.0
        JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17238/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17238/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 51s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 39s branch-2 passed +1 compile 0m 38s branch-2 passed with JDK v1.8.0_101 +1 compile 0m 44s branch-2 passed with JDK v1.7.0_111 +1 checkstyle 0m 28s branch-2 passed +1 mvnsite 0m 52s branch-2 passed +1 mvneclipse 0m 16s branch-2 passed +1 findbugs 1m 57s branch-2 passed +1 javadoc 0m 54s branch-2 passed with JDK v1.8.0_101 +1 javadoc 1m 34s branch-2 passed with JDK v1.7.0_111 +1 mvninstall 0m 43s the patch passed +1 compile 0m 36s the patch passed with JDK v1.8.0_101 +1 javac 0m 36s the patch passed +1 compile 0m 40s the patch passed with JDK v1.7.0_111 +1 javac 0m 40s the patch passed +1 checkstyle 0m 24s hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 50 unchanged - 7 fixed = 50 total (was 57) +1 mvnsite 0m 49s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 9s the patch passed +1 javadoc 0m 50s the patch passed with JDK v1.8.0_101 +1 javadoc 1m 32s the patch passed with JDK v1.7.0_111 +1 unit 48m 20s hadoop-hdfs in the patch passed with JDK v1.7.0_111. +1 asflicense 0m 22s The patch does not generate ASF License warnings. 134m 54s Reason Tests JDK v1.8.0_101 Failed junit tests hadoop.hdfs.TestEncryptionZones Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HDFS-11030 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834547/HDFS-11030-branch-2.000.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 2ebe64861e73 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 1f384b6 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/17238/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/17238/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        jnp Jitendra Nath Pandey added a comment -

        +1. Thanks for the patch Mingliang Liu

        Show
        jnp Jitendra Nath Pandey added a comment - +1. Thanks for the patch Mingliang Liu
        Hide
        liuml07 Mingliang Liu added a comment -

        Commited to trunk and branch-2.8. Thanks Jitendra Nath Pandey for reviewing this.

        Show
        liuml07 Mingliang Liu added a comment - Commited to trunk and branch-2.8 . Thanks Jitendra Nath Pandey for reviewing this.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10738 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10738/)
        HDFS-11030. TestDataNodeVolumeFailure#testVolumeFailure is flaky (though (liuml07: rev 0c49f73a6c19ce0d0cd59cf8dfaa9a35f67f47ab)

        • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10738 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10738/ ) HDFS-11030 . TestDataNodeVolumeFailure#testVolumeFailure is flaky (though (liuml07: rev 0c49f73a6c19ce0d0cd59cf8dfaa9a35f67f47ab) (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDataNodeVolumeFailure.java

          People

          • Assignee:
            liuml07 Mingliang Liu
            Reporter:
            liuml07 Mingliang Liu
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development