Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-6617

Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op

    Details

    • Type: Test Test
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.0, 2.5.0
    • Fix Version/s: 2.6.0
    • Component/s: auto-failover, test
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      Just Hit a false alarm testing while working at HDFS-6614, see https://builds.apache.org/job/PreCommit-HDFS-Build/7259//testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestDFSZKFailoverController/testManualFailoverWithDFSHAAdmin/

      After a looking at the log, shows the failure came from a timeout at
      ZKFailoverController.doCedeActive():
      localTarget.getProxy(conf, timeout).transitionToStandby(createReqInfo());

      While stopping active service, see FSNamesystem.stopActiveServices():
      void stopActiveServices() {
      LOG.info("Stopping services started for active state");
      ....
      this corelates with the log:
      "2014-07-01 08:12:50,615 INFO namenode.FSNamesystem (FSNamesystem.java:stopActiveServices(1167)) - Stopping services started for active state"

      then stopActiveServices will call editLog.close(), which goes to endCurrentLogSegment(), see log:
      2014-07-01 08:12:50,616 INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1216)) - Ending log segment 1

      but this operation did not finish in 5 seconds, then triggered the timeout:

      2014-07-01 08:12:55,624 WARN ha.ZKFailoverController (ZKFailoverController.java:doCedeActive(577)) - Unable to transition local node to standby: Call From asf001.sp2.ygridcore.net/67.195.138.31 to localhost:10021 failed on socket timeout exception: java.net.SocketTimeoutException: 5000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:53965 remote=localhost/127.0.0.1:10021]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout

      the logEdit/logSync finally done followed with printStatistics(true):
      2014-07-01 08:13:05,243 INFO namenode.FSEditLog (FSEditLog.java:printStatistics(675)) - Number of transactions: 2 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 3 SyncTimes(ms): 14667 74 105

      so obviously, this long sync contributed the timeout, maybe the QA box is very slow at that moment, so one possible fix here is setting the default fence timeout to a bigger one.

      1. HDFS-6617-v2.txt
        1 kB
        Liang Xie
      2. HDFS-6617.txt
        0.9 kB
        Liang Xie

        Activity

        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1825 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1825/)
        HDFS-6617. Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1825 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1825/ ) HDFS-6617 . Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Hide
        Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #1798 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1798/)
        HDFS-6617. Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Show
        Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1798 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1798/ ) HDFS-6617 . Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #607 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/607/)
        HDFS-6617. Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #607 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/607/ ) HDFS-6617 . Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Hide
        Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #5832 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5832/)
        HDFS-6617. Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522)

        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Show
        Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5832 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5832/ ) HDFS-6617 . Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op. Contributed by Liang Xie. (cnauroth: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1608522 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestDFSZKFailoverController.java
        Hide
        Chris Nauroth added a comment -

        I committed this to trunk and branch-2. Liang Xie, thank you for contributing this patch.

        Show
        Chris Nauroth added a comment - I committed this to trunk and branch-2. Liang Xie, thank you for contributing this patch.
        Hide
        Chris Nauroth added a comment -

        +1 for the patch. I'll commit this.

        Show
        Chris Nauroth added a comment - +1 for the patch. I'll commit this.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12653799/HDFS-6617-v2.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7278//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7278//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653799/HDFS-6617-v2.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7278//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7278//console This message is automatically generated.
        Hide
        Liang Xie added a comment -

        Chris Nauroth, the above suggestion should be better definitely, i made a patch v2, i confirmed this setting took effect with grep "FSEditLog.java:printStatistics" from log files w or w/o patch, so i am pretty sure it will fix the failure testing caused by the slow edit log sync operation please help to review, thank you!

        Show
        Liang Xie added a comment - Chris Nauroth , the above suggestion should be better definitely, i made a patch v2, i confirmed this setting took effect with grep "FSEditLog.java:printStatistics" from log files w or w/o patch, so i am pretty sure it will fix the failure testing caused by the slow edit log sync operation please help to review, thank you!
        Hide
        Chris Nauroth added a comment -

        Hi, Liang Xie. Thanks for contributing this. I'm curious if putting the below code snippet into TestDFSZKFailoverController would fix the problem for you. This should avoid the disk latency for edit logging during test runs, so I'm curious if this is another way to fix the problem in your environment. This also might make the test run faster. If not, then your current patch is fine too.

          static {
            EditLogFileOutputStream.setShouldSkipFsyncForTesting(true);
          }
        
        Show
        Chris Nauroth added a comment - Hi, Liang Xie . Thanks for contributing this. I'm curious if putting the below code snippet into TestDFSZKFailoverController would fix the problem for you. This should avoid the disk latency for edit logging during test runs, so I'm curious if this is another way to fix the problem in your environment. This also might make the test run faster. If not, then your current patch is fine too. static { EditLogFileOutputStream.setShouldSkipFsyncForTesting( true ); }
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12653373/HDFS-6617.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 1 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7261//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7261//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653373/HDFS-6617.txt against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7261//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7261//console This message is automatically generated.

          People

          • Assignee:
            Liang Xie
            Reporter:
            Liang Xie
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development