Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10333

Intermittent org.apache.hadoop.hdfs.TestFileAppend failure in trunk

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: hdfs
    • Labels:
      None
    • Target Version/s:

      Description

      Java8 (I used JAVA_HOME=/opt/toolchain/jdk1.8.0_25):

      ------------------------------------------------------
       T E S T S
      -------------------------------------------------------
      Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0
      Running org.apache.hadoop.hdfs.TestFileAppend
      Tests run: 12, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 27.75 sec <<< FAILURE! - in org.apache.hadoop.hdfs.TestFileAppend
      testMultipleAppends(org.apache.hadoop.hdfs.TestFileAppend)  Time elapsed: 3.674 sec  <<< ERROR!
      java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[127.0.0.1:43067,DS-cf80da41-3697-4afa-8f89-93693cd5035d,DISK], DatanodeInfoWithStorage[127.0.0.1:32946,DS-3b08422c-959e-42f0-a624-91b2524c4371,DISK]], original=[DatanodeInfoWithStorage[127.0.0.1:43067,DS-cf80da41-3697-4afa-8f89-93693cd5035d,DISK], DatanodeInfoWithStorage[127.0.0.1:32946,DS-3b08422c-959e-42f0-a624-91b2524c4371,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
              at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1166)
              at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1232)
              at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1423)
              at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1338)
              at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1321)
              at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:599)
      
      
      

      However, when I run with Java1.7, the test is sometimes successful, and it sometimes fails with

      Tests run: 12, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 41.32 sec <<< FAILURE! - in org.apache.hadoop.hdfs.TestFileAppend
      testMultipleAppends(org.apache.hadoop.hdfs.TestFileAppend)  Time elapsed: 9.099 sec  <<< ERROR!
      java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[127.0.0.1:49006,DS-498240fa-d1c7-4ba1-b97e-a1761cbbefa5,DISK], DatanodeInfoWithStorage[127.0.0.1:43097,DS-b83b49ce-fc14-4b9e-a3fc-7df2cd9fc753,DISK]], original=[DatanodeInfoWithStorage[127.0.0.1:49006,DS-498240fa-d1c7-4ba1-b97e-a1761cbbefa5,DISK], DatanodeInfoWithStorage[127.0.0.1:43097,DS-b83b49ce-fc14-4b9e-a3fc-7df2cd9fc753,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
      	at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1162)
      	at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1232)
      	at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1423)
      	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1338)
      	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1321)
      	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:599)
      
      

      The failure of this test is intermittent, but it fails pretty often.

        Activity

        Hide
        linyiqun Yiqun Lin added a comment -

        Thanks Andrew Wang for commit!

        Show
        linyiqun Yiqun Lin added a comment - Thanks Andrew Wang for commit!
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9764 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9764/)
        HDFS-10333. Intermittent org.apache.hadoop.hdfs.TestFileAppend failure (wang: rev 45788204ae2ac82ccb3b4fe2fd22aead1dd79f0d)

        • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileAppend.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9764 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9764/ ) HDFS-10333 . Intermittent org.apache.hadoop.hdfs.TestFileAppend failure (wang: rev 45788204ae2ac82ccb3b4fe2fd22aead1dd79f0d) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestFileAppend.java
        Hide
        andrew.wang Andrew Wang added a comment -

        LGTM +1, committed back through branch-2.8. Thank you for the contribution Yiqun Lin!

        Show
        andrew.wang Andrew Wang added a comment - LGTM +1, committed back through branch-2.8. Thank you for the contribution Yiqun Lin !
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 11s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 49s trunk passed
        +1 compile 0m 38s trunk passed with JDK v1.8.0_91
        +1 compile 0m 41s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 27s trunk passed
        +1 mvnsite 0m 51s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 1m 55s trunk passed
        +1 javadoc 1m 5s trunk passed with JDK v1.8.0_91
        +1 javadoc 1m 46s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 46s the patch passed
        +1 compile 0m 35s the patch passed with JDK v1.8.0_91
        +1 javac 0m 35s the patch passed
        +1 compile 0m 38s the patch passed with JDK v1.7.0_95
        +1 javac 0m 38s the patch passed
        +1 checkstyle 0m 23s the patch passed
        +1 mvnsite 0m 48s the patch passed
        +1 mvneclipse 0m 11s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 2m 5s the patch passed
        +1 javadoc 0m 59s the patch passed with JDK v1.8.0_91
        +1 javadoc 1m 42s the patch passed with JDK v1.7.0_95
        -1 unit 57m 3s hadoop-hdfs in the patch failed with JDK v1.8.0_91.
        +1 unit 54m 45s hadoop-hdfs in the patch passed with JDK v1.7.0_95.
        +1 asflicense 0m 24s Patch does not generate ASF License warnings.
        136m 57s



        Reason Tests
        JDK v1.8.0_91 Failed junit tests hadoop.hdfs.shortcircuit.TestShortCircuitCache



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:cf2ee45
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802974/HDFS-10333.001.patch
        JIRA Issue HDFS-10333
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux beb0d668ac24 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 411fb4b
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-HDFS-Build/15396/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
        unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15396/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15396/testReport/
        modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
        Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15396/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 49s trunk passed +1 compile 0m 38s trunk passed with JDK v1.8.0_91 +1 compile 0m 41s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 1m 55s trunk passed +1 javadoc 1m 5s trunk passed with JDK v1.8.0_91 +1 javadoc 1m 46s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 46s the patch passed +1 compile 0m 35s the patch passed with JDK v1.8.0_91 +1 javac 0m 35s the patch passed +1 compile 0m 38s the patch passed with JDK v1.7.0_95 +1 javac 0m 38s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 48s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 2m 5s the patch passed +1 javadoc 0m 59s the patch passed with JDK v1.8.0_91 +1 javadoc 1m 42s the patch passed with JDK v1.7.0_95 -1 unit 57m 3s hadoop-hdfs in the patch failed with JDK v1.8.0_91. +1 unit 54m 45s hadoop-hdfs in the patch passed with JDK v1.7.0_95. +1 asflicense 0m 24s Patch does not generate ASF License warnings. 136m 57s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.hdfs.shortcircuit.TestShortCircuitCache Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802974/HDFS-10333.001.patch JIRA Issue HDFS-10333 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux beb0d668ac24 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 411fb4b Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15396/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-HDFS-Build/15396/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_91.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15396/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15396/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        linyiqun Yiqun Lin added a comment -

        Sorry, I correct the opinion before:

        The test will failed once the socket io timeout happens

        This should be modify as The test will has a more chance to fail once socket io timeout frquently happen over a period of time.

        Show
        linyiqun Yiqun Lin added a comment - Sorry, I correct the opinion before: The test will failed once the socket io timeout happens This should be modify as The test will has a more chance to fail once socket io timeout frquently happen over a period of time.
        Hide
        linyiqun Yiqun Lin added a comment -

        I tested this test and other similar tests in TestFileAppend in my local. I found connection refused happed sometimes due to socket io timeout. This are logs

        2016-05-09 17:33:49,828 [DataXceiver for client DFSClient_NONMAPREDUCE_140749666_1 at /127.0.0.1:58040 [Receiving block BP-2032095287-127.0.0.1-1462786332089:blk_1073741827_1003]] ERROR datanode.DataNode (DataXceiver.java:run(316)) - 127.0.0.1:58021:DataXceiver error processing WRITE_BLOCK operation  src: /127.0.0.1:58040 dst: /127.0.0.1:58021
        java.net.ConnectException: Connection refused
        	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
        	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:746)
        	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:171)
        	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:105)
        	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289)
        	at java.lang.Thread.run(Thread.java:745)
        

        And then this will cause IOException and set first target node as a bad node(actually it means datanode that failed in connection setup)into the response.

                } catch (IOException e) {
                  if (isClient) {
                    BlockOpResponseProto.newBuilder()
                      .setStatus(ERROR)
                       // NB: Unconditionally using the xfer addr w/o hostname
                      .setFirstBadLink(targets[0].getXferAddr())
                      .build()
                      .writeDelimitedTo(replyOut);
                    replyOut.flush();
                  }
        

        And this node index will be set into errorState and then marked this node as bad node. The code will not return false here

        if (!errorState.hasDatanodeError() && !shouldHandleExternalError()) {
              return false;
        }
        

        ,then continue to execute setupPipelineForAppendOrRecovery in processDatanodeOrExternalError. And finally it will print the msg "no more good datanodes being available to replace a bad datanode on the existing pipeline".

        So we should disable the property dfs.client.block.write.replace-datanode-on-failure.enable, and there is no need to set policy of dfs.client.block.write.replace-datanode-on-failure.policy here. The test will failed once the socket io timeout happens. In addition, the similar test TestFileAppend#testMultiAppend2 has already disabled this.

        I will post a patch for this later, thanks review.

        Show
        linyiqun Yiqun Lin added a comment - I tested this test and other similar tests in TestFileAppend in my local. I found connection refused happed sometimes due to socket io timeout. This are logs 2016-05-09 17:33:49,828 [DataXceiver for client DFSClient_NONMAPREDUCE_140749666_1 at /127.0.0.1:58040 [Receiving block BP-2032095287-127.0.0.1-1462786332089:blk_1073741827_1003]] ERROR datanode.DataNode (DataXceiver.java:run(316)) - 127.0.0.1:58021:DataXceiver error processing WRITE_BLOCK operation src: /127.0.0.1:58040 dst: /127.0.0.1:58021 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:746) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:105) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:289) at java.lang. Thread .run( Thread .java:745) And then this will cause IOException and set first target node as a bad node(actually it means datanode that failed in connection setup)into the response. } catch (IOException e) { if (isClient) { BlockOpResponseProto.newBuilder() .setStatus(ERROR) // NB: Unconditionally using the xfer addr w/o hostname .setFirstBadLink(targets[0].getXferAddr()) .build() .writeDelimitedTo(replyOut); replyOut.flush(); } And this node index will be set into errorState and then marked this node as bad node. The code will not return false here if (!errorState.hasDatanodeError() && !shouldHandleExternalError()) { return false ; } ,then continue to execute setupPipelineForAppendOrRecovery in processDatanodeOrExternalError . And finally it will print the msg "no more good datanodes being available to replace a bad datanode on the existing pipeline". So we should disable the property dfs.client.block.write.replace-datanode-on-failure.enable , and there is no need to set policy of dfs.client.block.write.replace-datanode-on-failure.policy here. The test will failed once the socket io timeout happens. In addition, the similar test TestFileAppend#testMultiAppend2 has already disabled this. I will post a patch for this later, thanks review.

          People

          • Assignee:
            linyiqun Yiqun Lin
            Reporter:
            yzhangal Yongjun Zhang
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development