Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1806

TestBlockReport.blockReport_08() and _09() are timing-dependent and likely to fail on fast servers

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0, 0.23.0
    • Fix Version/s: 0.22.0, 0.23.0
    • Component/s: datanode, namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Method waitForTempReplica() polls every 100ms during block replication, attempting to "catch" a datanode in the state of having a TEMPORARY replica. But examination of a current Hudson test failure log shows that the replica goes from "start" to "TEMPORARY" to "FINALIZED" in only 50ms, so of course the poll usually misses it.

      1. blockReport_08_failure_log.html
        41 kB
        Matt Foley
      2. TestBlockReport.java.patch
        1 kB
        Matt Foley

        Activity

        Hide
        Matt Foley added a comment -

        In the attached log excerpt, from Apache Hudson/Jenkins QA auto-test, replication starts at the datanode at 2:58:09,608.
        The datanode moves the replica from tmp to finalized at 2:58:09,659-660.
        Then at 2:58:09,663 we see the message "Replication state before the loop 0", which represents the START of polling – way too late.

        So both the waitTil(100) and waitTil(50) lines in waitForTempReplica() are too long.

        Show
        Matt Foley added a comment - In the attached log excerpt, from Apache Hudson/Jenkins QA auto-test, replication starts at the datanode at 2:58:09,608. The datanode moves the replica from tmp to finalized at 2:58:09,659-660. Then at 2:58:09,663 we see the message "Replication state before the loop 0", which represents the START of polling – way too late. So both the waitTil(100) and waitTil(50) lines in waitForTempReplica() are too long.
        Hide
        Konstantin Boudnik added a comment -

        Great analysis, Matt! Thanks for looking into this long-standing issue (I believe there's a JIRA for this already). So, if the second loop is merged with the first one and the poll time is shortened significantly (say, 5 ms?) would it solve the problem?

        Show
        Konstantin Boudnik added a comment - Great analysis, Matt! Thanks for looking into this long-standing issue (I believe there's a JIRA for this already). So, if the second loop is merged with the first one and the poll time is shortened significantly (say, 5 ms?) would it solve the problem?
        Hide
        Matt Foley added a comment -

        Hi Cos, I hope so. I'm going to try it with both loops polling at 5ms. If it gets through Hudson I'll tie in both HDFS-953 and HDFS-733.

        Show
        Matt Foley added a comment - Hi Cos, I hope so. I'm going to try it with both loops polling at 5ms. If it gets through Hudson I'll tie in both HDFS-953 and HDFS-733 .
        Hide
        Matt Foley added a comment -

        Seems to have done the trick. Posted here for information, but I'm going to subordinate this bug to HDFS-1295 (which it was blocking) and submit a single patch to that Jira.

        Show
        Matt Foley added a comment - Seems to have done the trick. Posted here for information, but I'm going to subordinate this bug to HDFS-1295 (which it was blocking) and submit a single patch to that Jira.
        Hide
        Konstantin Boudnik added a comment -

        Great finding, Matt! So, you are able to see the improvement with the patch on the same machine where intermittent failure was observed before, right?

        Show
        Konstantin Boudnik added a comment - Great finding, Matt! So, you are able to see the improvement with the patch on the same machine where intermittent failure was observed before, right?
        Hide
        Matt Foley added a comment -

        Well, strictly speaking since it is an intermittent failure, we'll actually have to wait and see

        But yes, I see improvement on the same machine and in the same test circumstances where pretty consistent intermittent failure was previously observed:

        HDFS-1295 just passed auto-test without a TestBlockReport failure; whereas it had failed previously, and indeed it has been a long time since I saw TestBlockReport NOT be on the Hudson failure list for several HDFS bugs I was interested in.

        Altho this is a single instance, I think we've got rational grounds for believing that we now understand the probable cause and have addressed it. Regrettably I can't test the failure locally – it doesn't fail on any machine I've tried it on here. The proof of the pudding will be seeing if the community stops being bothered by this false positive after it is submitted to the public code base.

        Show
        Matt Foley added a comment - Well, strictly speaking since it is an intermittent failure, we'll actually have to wait and see But yes, I see improvement on the same machine and in the same test circumstances where pretty consistent intermittent failure was previously observed: HDFS-1295 just passed auto-test without a TestBlockReport failure; whereas it had failed previously, and indeed it has been a long time since I saw TestBlockReport NOT be on the Hudson failure list for several HDFS bugs I was interested in. Altho this is a single instance, I think we've got rational grounds for believing that we now understand the probable cause and have addressed it. Regrettably I can't test the failure locally – it doesn't fail on any machine I've tried it on here. The proof of the pudding will be seeing if the community stops being bothered by this false positive after it is submitted to the public code base.
        Hide
        Konstantin Boudnik added a comment -

        Regrettably I can't test the failure locally – it doesn't fail on any machine I've tried it on here.

        I know. That has been a standing issue because the problem wasn't surfacing pretty much anywhere but on ASF Hudson for one reason or another. Thanks again!

        Show
        Konstantin Boudnik added a comment - Regrettably I can't test the failure locally – it doesn't fail on any machine I've tried it on here. I know. That has been a standing issue because the problem wasn't surfacing pretty much anywhere but on ASF Hudson for one reason or another. Thanks again!
        Hide
        Konstantin Boudnik added a comment -

        +1 patch looks good.
        Matt, please let me know if you need help committing it.

        Show
        Konstantin Boudnik added a comment - +1 patch looks good. Matt, please let me know if you need help committing it.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12476071/TestBlockReport.java.patch
        against trunk revision 1092534.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestLargeBlock

        -1 contrib tests. The patch failed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/369//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/369//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/369//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476071/TestBlockReport.java.patch against trunk revision 1092534. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestLargeBlock -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/369//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/369//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/369//console This message is automatically generated.
        Hide
        Matt Foley added a comment -

        These failures are unrelated to the patch, which only modifies TestBlockReport.

        Show
        Matt Foley added a comment - These failures are unrelated to the patch, which only modifies TestBlockReport.
        Hide
        Eli Collins added a comment -

        +1 Change lgtm.

        TestBlockReport no longer fails while looping it for several hours after applying this patch.

        Show
        Eli Collins added a comment - +1 Change lgtm. TestBlockReport no longer fails while looping it for several hours after applying this patch.
        Hide
        Eli Collins added a comment -

        I've committed this to branch 22 and trunk. Thanks Matt!

        Show
        Eli Collins added a comment - I've committed this to branch 22 and trunk. Thanks Matt!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)
        HDFS-1806. TestBlockReport.blockReport_08() and _09() are timing-dependent and likely to fail on fast servers. Contributed by Matt Foley.

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ ) HDFS-1806 . TestBlockReport.blockReport_08() and _09() are timing-dependent and likely to fail on fast servers. Contributed by Matt Foley.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #600 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/600/)

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #600 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/600/ )
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/)
        HDFS-1806. svn merge -c 1095235 from trunk

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/ ) HDFS-1806 . svn merge -c 1095235 from trunk

          People

          • Assignee:
            Matt Foley
            Reporter:
            Matt Foley
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development