Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4699

TestPipelinesFailover#testPipelineRecoveryStress fails sporadically

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 3.0.0, 0.23.7, 2.0.4-alpha
    • Fix Version/s: 2.1.0-beta, 0.23.8, 1.2.1
    • Component/s: test
    • Labels:
      None

      Description

      I have seen TestPipelinesFailover#testPipelineRecoveryStress fail sporadically due to timeout during loopRecoverLease, which waits for up to 30 seconds before timing out.

      1. HDFS-4699.branch-1.1.patch
        1.0 kB
        Chris Nauroth
      2. HDFS-4699.branch-0.23.1.patch
        1 kB
        Chris Nauroth
      3. HDFS-4699.1.patch
        3 kB
        Chris Nauroth

        Issue Links

          Activity

          Hide
          Chris Nauroth added a comment -

          This patch addresses multiple problems that were contributing to the intermittent failures:

          1. When BlockReceiver gets an IOException, it tries to assess if the error was related to disk or network, and if disk-related, calls DiskChecker. This test triggers rapid NN failovers, so it's common to see a mix of different kinds of network errors. The logic for detecting a network error was incomplete and miscategorizing some network failures as disk related, and triggering a huge flurry of DiskChecker activity. Particularly on Windows, rapid calls to this can be sluggish, because it needs to fork a new process. I've added logic to filter out TCP RST and anything related to a java.nio.channels.SocketChannel.
          2. The test triggers rapid NN failovers. The client retry handling uses an exponential backoff with a maximum delay of 15s between failover attempts. Particularly on small VMs, I saw multiple failover attempts quickly rising to a 15s delay and sometimes causing the whole test to timeout. I've made a change to set configuration to cap the failover delay to 1s.
          3. There is a polling loop that tries to wait up to 30s for lease recovery. Even with the prior changes, I've observed that 30s isn't sufficient on a small VM. After I increased this to 60s, I saw consistent successful test runs.

          I've verified the test on both Mac and Windows.

          Show
          Chris Nauroth added a comment - This patch addresses multiple problems that were contributing to the intermittent failures: When BlockReceiver gets an IOException , it tries to assess if the error was related to disk or network, and if disk-related, calls DiskChecker . This test triggers rapid NN failovers, so it's common to see a mix of different kinds of network errors. The logic for detecting a network error was incomplete and miscategorizing some network failures as disk related, and triggering a huge flurry of DiskChecker activity. Particularly on Windows, rapid calls to this can be sluggish, because it needs to fork a new process. I've added logic to filter out TCP RST and anything related to a java.nio.channels.SocketChannel . The test triggers rapid NN failovers. The client retry handling uses an exponential backoff with a maximum delay of 15s between failover attempts. Particularly on small VMs, I saw multiple failover attempts quickly rising to a 15s delay and sometimes causing the whole test to timeout. I've made a change to set configuration to cap the failover delay to 1s. There is a polling loop that tries to wait up to 30s for lease recovery. Even with the prior changes, I've observed that 30s isn't sufficient on a small VM. After I increased this to 60s, I saw consistent successful test runs. I've verified the test on both Mac and Windows.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579025/HDFS-4699.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579025/HDFS-4699.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//console This message is automatically generated.
          Hide
          Chris Nauroth added a comment -

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:
          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          The test failure is unrelated to this patch. The test is known to be flaky. See HDFS-3538.

          Show
          Chris Nauroth added a comment - -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks The test failure is unrelated to this patch. The test is known to be flaky. See HDFS-3538 .
          Hide
          Chris Nauroth added a comment -

          I noticed that HDFS-4581 also put the network error filtering logic into branch-1 and branch-0.23, so I've added patches for those branches too. I uploaded the trunk patch again, just so that Jenkins sees it as the newest file.

          Show
          Chris Nauroth added a comment - I noticed that HDFS-4581 also put the network error filtering logic into branch-1 and branch-0.23, so I've added patches for those branches too. I uploaded the trunk patch again, just so that Jenkins sees it as the newest file.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579356/HDFS-4699.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579356/HDFS-4699.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//console This message is automatically generated.
          Hide
          Kihwal Lee added a comment -

          +1 the patch looks good. It's sad that we have to parse string, but even Sun's NIO example code did it that way.

          Show
          Kihwal Lee added a comment - +1 the patch looks good. It's sad that we have to parse string, but even Sun's NIO example code did it that way.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3635 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3635/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = SUCCESS
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3635 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3635/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Kihwal Lee added a comment -

          I've committed this to trunk, branch-2, branch-0.23 and branch-1. Thanks for working on this, Chris.

          Show
          Kihwal Lee added a comment - I've committed this to trunk, branch-2, branch-0.23 and branch-1. Thanks for working on this, Chris.
          Hide
          Chris Nauroth added a comment -

          Thanks for the commit, Kihwal!

          It's sad that we have to parse string, but even Sun's NIO example code did it that way.

          Yes, it's definitely brittle. Longer-term, perhaps there is a way to refactor such that at any point in our code, we know whether the I/O error came from disk or network? I tried investigating this briefly, but it certainly would be a much bigger and riskier change.

          Show
          Chris Nauroth added a comment - Thanks for the commit, Kihwal! It's sad that we have to parse string, but even Sun's NIO example code did it that way. Yes, it's definitely brittle. Longer-term, perhaps there is a way to refactor such that at any point in our code, we know whether the I/O error came from disk or network? I tried investigating this briefly, but it certainly would be a much bigger and riskier change.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #189 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/189/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = SUCCESS
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #189 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/189/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #587 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/587/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469842)

          Result = UNSTABLE
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469842
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #587 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/587/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469842) Result = UNSTABLE kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469842 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1378 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1378/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = FAILURE
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1378 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1378/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = FAILURE kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1405 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1405/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = SUCCESS
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1405 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1405/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Suresh Srinivas added a comment -

          I merged this patch to branch-1.2 to be picked up for 1.2.1.

          Show
          Suresh Srinivas added a comment - I merged this patch to branch-1.2 to be picked up for 1.2.1.
          Hide
          Matt Foley added a comment -

          Changed the fixVersion from 1.3.0 to 1.2.1 per Suresh's merge.
          Also made TargetVersion list consistent with fixVersion list.

          Show
          Matt Foley added a comment - Changed the fixVersion from 1.3.0 to 1.2.1 per Suresh's merge. Also made TargetVersion list consistent with fixVersion list.

            People

            • Assignee:
              Chris Nauroth
              Reporter:
              Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development