Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10283

o.a.h.hdfs.server.namenode.TestFSImageWithSnapshot#testSaveLoadImageWithAppending fails intermittently

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.8.0
    • Fix Version/s: 2.9.0, 3.0.0-alpha1
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The test fails with exception as following:

      java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[127.0.0.1:47227,DS-dd109c14-79e5-4380-ac5e-4434cd7e25b5,DISK], DatanodeInfoWithStorage[127.0.0.1:56949,DS-6c0be75e-a78c-41b9-bfd0-7ee0cdefaa0e,DISK]], original=[DatanodeInfoWithStorage[127.0.0.1:47227,DS-dd109c14-79e5-4380-ac5e-4434cd7e25b5,DISK], DatanodeInfoWithStorage[127.0.0.1:56949,DS-6c0be75e-a78c-41b9-bfd0-7ee0cdefaa0e,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
      	at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1162)
      	at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1232)
      	at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1423)
      	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1338)
      	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1321)
      	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:599)
      

        Activity

        Hide
        liuml07 Mingliang Liu added a comment -

        It happens in our internal daily UT Jenkins, and recent Apache trunk pre-commit builds (e.g. https://builds.apache.org/job/PreCommit-HDFS-Build/15140/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt)

        Before this exception, the NN has complained about not enough replicas as following:

        2016-04-12 13:21:30,511 WARN  blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(380)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
        2016-04-12 13:21:30,511 WARN  blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(380)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
        2016-04-12 13:21:30,511 WARN  protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(162)) - Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
        2016-04-12 13:21:30,512 WARN  blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(380)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All required storage types are unavailable:  unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
        

        The basic problem here is that the number of datanodes equals to the replication numbers (which is 3 in the test). If there is failure in the writing (appending) pipeline, there is no more good datanodes to replace with. The block manager will complain the above error but will not fail the request. That's why we did not see exceptions thrown by NN. The client side will check that there is no newly allocated DN for replacing nodes in the pipeline, and the DataStreamer will throw the exception as java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.

        Show
        liuml07 Mingliang Liu added a comment - It happens in our internal daily UT Jenkins, and recent Apache trunk pre-commit builds (e.g. https://builds.apache.org/job/PreCommit-HDFS-Build/15140/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt ) Before this exception, the NN has complained about not enough replicas as following: 2016-04-12 13:21:30,511 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(380)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock= false ) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-04-12 13:21:30,511 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(380)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock= false ) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-04-12 13:21:30,511 WARN protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(162)) - Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=3, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-04-12 13:21:30,512 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(380)) - Failed to place enough replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock= false ) All required storage types are unavailable: unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} The basic problem here is that the number of datanodes equals to the replication numbers (which is 3 in the test). If there is failure in the writing (appending) pipeline, there is no more good datanodes to replace with. The block manager will complain the above error but will not fail the request. That's why we did not see exceptions thrown by NN. The client side will check that there is no newly allocated DN for replacing nodes in the pipeline, and the DataStreamer will throw the exception as java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.
        Hide
        liuml07 Mingliang Liu added a comment -

        A simple fix is to disable the replace-data-on-failure feature by setting the dfs.client.block.write.replace-datanode-on-failure.policy key as false.

        Actually the test is not related to writing (appending) pipelines. When creating files, we can simply reduce the replica factor from 3 to 1 and keeping the number of datanodes as 3 for placing. Another benefit is that the test will be faster as the pipeline is much shorter.

        Show
        liuml07 Mingliang Liu added a comment - A simple fix is to disable the replace-data-on-failure feature by setting the dfs.client.block.write.replace-datanode-on-failure.policy key as false. Actually the test is not related to writing (appending) pipelines. When creating files, we can simply reduce the replica factor from 3 to 1 and keeping the number of datanodes as 3 for placing. Another benefit is that the test will be faster as the pipeline is much shorter.
        Hide
        jingzhao Jing Zhao added a comment -

        +1 pending jenkins.

        Show
        jingzhao Jing Zhao added a comment - +1 pending jenkins.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 17m 57s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 43s trunk passed
        +1 compile 0m 40s trunk passed with JDK v1.8.0_77
        +1 compile 0m 41s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 21s trunk passed
        +1 mvnsite 0m 52s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 1m 58s trunk passed
        +1 javadoc 1m 14s trunk passed with JDK v1.8.0_77
        +1 javadoc 1m 58s trunk passed with JDK v1.7.0_95
        +1 mvninstall 1m 0s the patch passed
        +1 compile 0m 50s the patch passed with JDK v1.8.0_77
        +1 javac 0m 50s the patch passed
        +1 compile 0m 45s the patch passed with JDK v1.7.0_95
        +1 javac 0m 45s the patch passed
        +1 checkstyle 0m 19s the patch passed
        +1 mvnsite 0m 54s the patch passed
        +1 mvneclipse 0m 12s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 2m 6s the patch passed
        +1 javadoc 1m 5s the patch passed with JDK v1.8.0_77
        +1 javadoc 1m 40s the patch passed with JDK v1.7.0_95
        -1 unit 78m 50s hadoop-hdfs in the patch failed with JDK v1.8.0_77.
        -1 unit 77m 31s hadoop-hdfs in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 23s Patch does not generate ASF License warnings.
        200m 30s



        Reason Tests
        JDK v1.8.0_77 Failed junit tests hadoop.hdfs.TestHFlush
          hadoop.hdfs.TestReadStripedFileWithMissingBlocks
          hadoop.hdfs.server.namenode.TestFSEditLogLoader
          hadoop.hdfs.tools.TestDFSAdmin
          hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot
          hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead
          hadoop.hdfs.TestDFSStripedOutputStreamWithFailure
          hadoop.hdfs.TestDistributedFileSystem
        JDK v1.8.0_77 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile
          org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding
        JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush
          hadoop.hdfs.TestReadStripedFileWithMissingBlocks
          hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped
          hadoop.hdfs.tools.TestDFSAdmin
          hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead
          hadoop.hdfs.TestDFSStripedOutputStreamWithFailure
          hadoop.hdfs.TestRollingUpgrade
        JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile
          org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:fbe3e86
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798593/HDFS-10283.000.patch
        JIRA Issue HDFS-10283
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux bb3db2c7fb49 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 192112d
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15154/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15154/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 17m 57s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 43s trunk passed +1 compile 0m 40s trunk passed with JDK v1.8.0_77 +1 compile 0m 41s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 52s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 58s trunk passed +1 javadoc 1m 14s trunk passed with JDK v1.8.0_77 +1 javadoc 1m 58s trunk passed with JDK v1.7.0_95 +1 mvninstall 1m 0s the patch passed +1 compile 0m 50s the patch passed with JDK v1.8.0_77 +1 javac 0m 50s the patch passed +1 compile 0m 45s the patch passed with JDK v1.7.0_95 +1 javac 0m 45s the patch passed +1 checkstyle 0m 19s the patch passed +1 mvnsite 0m 54s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 6s the patch passed +1 javadoc 1m 5s the patch passed with JDK v1.8.0_77 +1 javadoc 1m 40s the patch passed with JDK v1.7.0_95 -1 unit 78m 50s hadoop-hdfs in the patch failed with JDK v1.8.0_77. -1 unit 77m 31s hadoop-hdfs in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 23s Patch does not generate ASF License warnings. 200m 30s Reason Tests JDK v1.8.0_77 Failed junit tests hadoop.hdfs.TestHFlush   hadoop.hdfs.TestReadStripedFileWithMissingBlocks   hadoop.hdfs.server.namenode.TestFSEditLogLoader   hadoop.hdfs.tools.TestDFSAdmin   hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot   hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure   hadoop.hdfs.TestDistributedFileSystem JDK v1.8.0_77 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile   org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding JDK v1.7.0_95 Failed junit tests hadoop.hdfs.TestHFlush   hadoop.hdfs.TestReadStripedFileWithMissingBlocks   hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped   hadoop.hdfs.tools.TestDFSAdmin   hadoop.hdfs.shortcircuit.TestShortCircuitLocalRead   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure   hadoop.hdfs.TestRollingUpgrade JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.hdfs.TestWriteReadStripedFile   org.apache.hadoop.hdfs.TestReadStripedFileWithDecoding Subsystem Report/Notes Docker Image:yetus/hadoop:fbe3e86 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12798593/HDFS-10283.000.patch JIRA Issue HDFS-10283 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux bb3db2c7fb49 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 192112d Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_77 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt unit https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77.txt https://builds.apache.org/job/PreCommit-HDFS-Build/15154/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15154/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15154/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        liuml07 Mingliang Liu added a comment -

        Failing tests are not related as the changes are only for test case o.a.h.hdfs.server.namenode.TestFSImageWithSnapshot#testSaveLoadImageWithAppending fails intermittently, which passed in the pre-commit run. I also ran it locally ~10 times and it was good.

        Show
        liuml07 Mingliang Liu added a comment - Failing tests are not related as the changes are only for test case o.a.h.hdfs.server.namenode.TestFSImageWithSnapshot#testSaveLoadImageWithAppending fails intermittently , which passed in the pre-commit run. I also ran it locally ~10 times and it was good.
        Hide
        jingzhao Jing Zhao added a comment -

        I've committed this to trunk and branch-2. Thanks for the contribution, Mingliang Liu!

        Show
        jingzhao Jing Zhao added a comment - I've committed this to trunk and branch-2. Thanks for the contribution, Mingliang Liu !
        Hide
        liuml07 Mingliang Liu added a comment -

        Thank you Jing Zhao for your review and commit.

        Show
        liuml07 Mingliang Liu added a comment - Thank you Jing Zhao for your review and commit.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9619 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9619/)
        HDFS-10283. (jing9: rev 89a838769ff5b6c64565e6949b14d7fed05daf54)

        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSImageWithSnapshot.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9619 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9619/ ) HDFS-10283 . (jing9: rev 89a838769ff5b6c64565e6949b14d7fed05daf54) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestFSImageWithSnapshot.java

          People

          • Assignee:
            liuml07 Mingliang Liu
            Reporter:
            liuml07 Mingliang Liu
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development