Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-4699

TestPipelinesFailover#testPipelineRecoveryStress fails sporadically

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0, 3.0.0, 0.23.7, 2.0.4-alpha
    • Fix Version/s: 2.1.0-beta, 0.23.8, 1.2.1
    • Component/s: test
    • Labels:
      None

      Description

      I have seen TestPipelinesFailover#testPipelineRecoveryStress fail sporadically due to timeout during loopRecoverLease, which waits for up to 30 seconds before timing out.

      1. HDFS-4699.branch-1.1.patch
        1.0 kB
        Chris Nauroth
      2. HDFS-4699.branch-0.23.1.patch
        1 kB
        Chris Nauroth
      3. HDFS-4699.1.patch
        3 kB
        Chris Nauroth

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Patch Available Patch Available Open Open
          1d 18h 13m 1 Chris Nauroth 18/Apr/13 17:03
          Open Open Patch Available Patch Available
          1d 42m 2 Chris Nauroth 18/Apr/13 17:16
          Patch Available Patch Available Resolved Resolved
          22h 13m 1 Kihwal Lee 19/Apr/13 15:29
          Resolved Resolved Closed Closed
          106d 17h 54m 1 Matt Foley 04/Aug/13 09:23
          Allen Wittenauer made changes -
          Fix Version/s 3.0.0 [ 12320356 ]
          Matt Foley made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Matt Foley added a comment -

          Changed the fixVersion from 1.3.0 to 1.2.1 per Suresh's merge.
          Also made TargetVersion list consistent with fixVersion list.

          Show
          Matt Foley added a comment - Changed the fixVersion from 1.3.0 to 1.2.1 per Suresh's merge. Also made TargetVersion list consistent with fixVersion list.
          Matt Foley made changes -
          Target Version/s 1.2.1 [ 12324148 ] 0.23.8, 3.0.0, 2.1.0-beta, 1.2.1 [ 12324141, 12320356, 12324031, 12324148 ]
          Matt Foley made changes -
          Fix Version/s 1.2.1 [ 12324148 ]
          Fix Version/s 1.3.0 [ 12324328 ]
          Suresh Srinivas made changes -
          Affects Version/s 1.2.0 [ 12321657 ]
          Affects Version/s 1.3.0 [ 12324328 ]
          Target Version/s 3.0.0, 0.23.7, 2.0.4-alpha, 1.3.0 [ 12320356, 12323955, 12324136, 12324328 ] 1.2.1 [ 12324148 ]
          Hide
          Suresh Srinivas added a comment -

          I merged this patch to branch-1.2 to be picked up for 1.2.1.

          Show
          Suresh Srinivas added a comment - I merged this patch to branch-1.2 to be picked up for 1.2.1.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1405 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1405/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = SUCCESS
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1405 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1405/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1378 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1378/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = FAILURE
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1378 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1378/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = FAILURE kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #587 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/587/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469842)

          Result = UNSTABLE
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469842
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #587 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/587/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469842) Result = UNSTABLE kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469842 Files : /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #189 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/189/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = SUCCESS
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #189 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/189/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Chris Nauroth added a comment -

          Thanks for the commit, Kihwal!

          It's sad that we have to parse string, but even Sun's NIO example code did it that way.

          Yes, it's definitely brittle. Longer-term, perhaps there is a way to refactor such that at any point in our code, we know whether the I/O error came from disk or network? I tried investigating this briefly, but it certainly would be a much bigger and riskier change.

          Show
          Chris Nauroth added a comment - Thanks for the commit, Kihwal! It's sad that we have to parse string, but even Sun's NIO example code did it that way. Yes, it's definitely brittle. Longer-term, perhaps there is a way to refactor such that at any point in our code, we know whether the I/O error came from disk or network? I tried investigating this briefly, but it certainly would be a much bigger and riskier change.
          Kihwal Lee made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Target Version/s 3.0.0, 0.23.7, 2.0.4-alpha, 1.3.0 [ 12320356, 12323955, 12324136, 12324328 ] 0.23.7, 3.0.0, 2.0.4-alpha, 1.3.0 [ 12323955, 12320356, 12324136, 12324328 ]
          Fix Version/s 3.0.0 [ 12320356 ]
          Fix Version/s 2.0.5-beta [ 12324031 ]
          Fix Version/s 0.23.8 [ 12324141 ]
          Fix Version/s 1.3.0 [ 12324328 ]
          Resolution Fixed [ 1 ]
          Hide
          Kihwal Lee added a comment -

          I've committed this to trunk, branch-2, branch-0.23 and branch-1. Thanks for working on this, Chris.

          Show
          Kihwal Lee added a comment - I've committed this to trunk, branch-2, branch-0.23 and branch-1. Thanks for working on this, Chris.
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3635 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3635/)
          HDFS-4699. TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839)

          Result = SUCCESS
          kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839
          Files :

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3635 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3635/ ) HDFS-4699 . TestPipelinesFailover#testPipelineRecoveryStress fails sporadically. Contributed by Chris Nauroth. (Revision 1469839) Result = SUCCESS kihwal : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1469839 Files : /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestPipelinesFailover.java
          Hide
          Kihwal Lee added a comment -

          +1 the patch looks good. It's sad that we have to parse string, but even Sun's NIO example code did it that way.

          Show
          Kihwal Lee added a comment - +1 the patch looks good. It's sad that we have to parse string, but even Sun's NIO example code did it that way.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579356/HDFS-4699.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579356/HDFS-4699.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4274//console This message is automatically generated.
          Hide
          Chris Nauroth added a comment -

          I noticed that HDFS-4581 also put the network error filtering logic into branch-1 and branch-0.23, so I've added patches for those branches too. I uploaded the trunk patch again, just so that Jenkins sees it as the newest file.

          Show
          Chris Nauroth added a comment - I noticed that HDFS-4581 also put the network error filtering logic into branch-1 and branch-0.23, so I've added patches for those branches too. I uploaded the trunk patch again, just so that Jenkins sees it as the newest file.
          Chris Nauroth made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Chris Nauroth made changes -
          Attachment HDFS-4699.1.patch [ 12579356 ]
          Chris Nauroth made changes -
          Attachment HDFS-4699.1.patch [ 12579025 ]
          Chris Nauroth made changes -
          Attachment HDFS-4699.branch-0.23.1.patch [ 12579352 ]
          Attachment HDFS-4699.branch-1.1.patch [ 12579353 ]
          Chris Nauroth made changes -
          Affects Version/s 0.23.7 [ 12323955 ]
          Affects Version/s 2.0.4-alpha [ 12324136 ]
          Affects Version/s 1.3.0 [ 12324328 ]
          Target Version/s 3.0.0 [ 12320356 ] 3.0.0, 0.23.7, 2.0.4-alpha, 1.3.0 [ 12320356, 12323955, 12324136, 12324328 ]
          Chris Nauroth made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Chris Nauroth made changes -
          Link This issue relates to HDFS-4581 [ HDFS-4581 ]
          Hide
          Chris Nauroth added a comment -

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:
          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          The test failure is unrelated to this patch. The test is known to be flaky. See HDFS-3538.

          Show
          Chris Nauroth added a comment - -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks The test failure is unrelated to this patch. The test is known to be flaky. See HDFS-3538 .
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12579025/HDFS-4699.1.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12579025/HDFS-4699.1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRacks +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4266//console This message is automatically generated.
          Chris Nauroth made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Chris Nauroth made changes -
          Attachment HDFS-4699.1.patch [ 12579025 ]
          Hide
          Chris Nauroth added a comment -

          This patch addresses multiple problems that were contributing to the intermittent failures:

          1. When BlockReceiver gets an IOException, it tries to assess if the error was related to disk or network, and if disk-related, calls DiskChecker. This test triggers rapid NN failovers, so it's common to see a mix of different kinds of network errors. The logic for detecting a network error was incomplete and miscategorizing some network failures as disk related, and triggering a huge flurry of DiskChecker activity. Particularly on Windows, rapid calls to this can be sluggish, because it needs to fork a new process. I've added logic to filter out TCP RST and anything related to a java.nio.channels.SocketChannel.
          2. The test triggers rapid NN failovers. The client retry handling uses an exponential backoff with a maximum delay of 15s between failover attempts. Particularly on small VMs, I saw multiple failover attempts quickly rising to a 15s delay and sometimes causing the whole test to timeout. I've made a change to set configuration to cap the failover delay to 1s.
          3. There is a polling loop that tries to wait up to 30s for lease recovery. Even with the prior changes, I've observed that 30s isn't sufficient on a small VM. After I increased this to 60s, I saw consistent successful test runs.

          I've verified the test on both Mac and Windows.

          Show
          Chris Nauroth added a comment - This patch addresses multiple problems that were contributing to the intermittent failures: When BlockReceiver gets an IOException , it tries to assess if the error was related to disk or network, and if disk-related, calls DiskChecker . This test triggers rapid NN failovers, so it's common to see a mix of different kinds of network errors. The logic for detecting a network error was incomplete and miscategorizing some network failures as disk related, and triggering a huge flurry of DiskChecker activity. Particularly on Windows, rapid calls to this can be sluggish, because it needs to fork a new process. I've added logic to filter out TCP RST and anything related to a java.nio.channels.SocketChannel . The test triggers rapid NN failovers. The client retry handling uses an exponential backoff with a maximum delay of 15s between failover attempts. Particularly on small VMs, I saw multiple failover attempts quickly rising to a 15s delay and sometimes causing the whole test to timeout. I've made a change to set configuration to cap the failover delay to 1s. There is a polling loop that tries to wait up to 30s for lease recovery. Even with the prior changes, I've observed that 30s isn't sufficient on a small VM. After I increased this to 60s, I saw consistent successful test runs. I've verified the test on both Mac and Windows.
          Kihwal Lee made changes -
          Field Original Value New Value
          Link This issue is related to HDFS-4663 [ HDFS-4663 ]
          Chris Nauroth created issue -

            People

            • Assignee:
              Chris Nauroth
              Reporter:
              Chris Nauroth
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development