Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1829

TestNodeCount waits forever, errs without giving information

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In three locations in the code, TestNodeCount waits forever on a condition. Failures result in Hudson/Jenkins "Timeout occurred" error message with no information about where or why. Need to replace with TimeoutExceptions that throw a stack trace and useful info about the failure mode.

      Also investigate possible cause of failure.

      1. 1829_TestNodeCount_v4.patch
        4 kB
        Matt Foley
      2. 1829_TestNodeCount_v4.patch
        4 kB
        Matt Foley
      3. TestNodeCount_v2.patch
        3 kB
        Matt Foley
      4. TestNodeCount.java.patch
        3 kB
        Matt Foley

        Issue Links

          Activity

          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #609 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/609/)
          HDFS-1829. TestNodeCount waits forever, errs without giving information. Contributed by Matt Foley

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #609 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/609/ ) HDFS-1829 . TestNodeCount waits forever, errs without giving information. Contributed by Matt Foley
          Hide
          Eli Collins added a comment -

          I've committed this. Thanks Matt!

          Show
          Eli Collins added a comment - I've committed this. Thanks Matt!
          Hide
          Matt Foley added a comment -

          Neither of these failed core tests are related to this patch. No methods of TestNodeCount are called outside of this unit test.

          Ready for commit.

          Show
          Matt Foley added a comment - Neither of these failed core tests are related to this patch. No methods of TestNodeCount are called outside of this unit test. Ready for commit.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477065/1829_TestNodeCount_v4.patch
          against trunk revision 1095830.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477065/1829_TestNodeCount_v4.patch against trunk revision 1095830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//console This message is automatically generated.
          Hide
          Matt Foley added a comment -

          re-upload to trigger Hudson

          Show
          Matt Foley added a comment - re-upload to trigger Hudson
          Hide
          Matt Foley added a comment -

          Thanks, Eli. Opened HDFS-1853 for the suggested refactoring. Also opened HDFS-1852 as an umbrella task for gathering and discussing many possible such improvements.

          Show
          Matt Foley added a comment - Thanks, Eli. Opened HDFS-1853 for the suggested refactoring. Also opened HDFS-1852 as an umbrella task for gathering and discussing many possible such improvements.
          Hide
          Eli Collins added a comment -

          Eli, can we open another Jira for that?

          Absolutely. What you have here is the right direction, we can always refactor the wait to DFSTestUtil later.

          +1 on 1829_TestNodeCount_v4.patch

          Show
          Eli Collins added a comment - Eli, can we open another Jira for that? Absolutely. What you have here is the right direction, we can always refactor the wait to DFSTestUtil later. +1 on 1829_TestNodeCount_v4.patch
          Hide
          Matt Foley added a comment -

          Cos, this patch only modifies TestNodeCount, and TestNodeCount and its methods are not called from outside that unit test module, so other tests failing have nothing to do with this test. However, I am waiting for another run of Hudson and will respond to its results before asking for a commit. Thanks.

          Show
          Matt Foley added a comment - Cos, this patch only modifies TestNodeCount, and TestNodeCount and its methods are not called from outside that unit test module, so other tests failing have nothing to do with this test. However, I am waiting for another run of Hudson and will respond to its results before asking for a commit. Thanks.
          Hide
          Matt Foley added a comment -

          Eli, can we open another Jira for that? My goal here was only to get rid of the recurring false positive in Hudson for this unit test, which this patch does in a clean way. Your suggestion would be a good contribution to the toolkit (and I have been trying to do that along the way, as you know), but I've already spent more time than I should on this issue.

          Show
          Matt Foley added a comment - Eli, can we open another Jira for that? My goal here was only to get rid of the recurring false positive in Hudson for this unit test, which this patch does in a clean way. Your suggestion would be a good contribution to the toolkit (and I have been trying to do that along the way, as you know), but I've already spent more time than I should on this issue.
          Hide
          Eli Collins added a comment -

          How about handling these as is done for HDFS-1562? You could augment NameNodeAdapter#getReplicaInfo to return excess and live replica counts as well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil and have TestNodeCount call them. This way we could re-use them in the other replication tests.

          Show
          Eli Collins added a comment - How about handling these as is done for HDFS-1562 ? You could augment NameNodeAdapter#getReplicaInfo to return excess and live replica counts as well and then just add waitFor [Live|Excess] Replicas methods to DFSTestUtil and have TestNodeCount call them. This way we could re-use them in the other replication tests.
          Hide
          Konstantin Boudnik added a comment -

          Ah, 'better simpler than clever'! You've avoided callables all together yet the code doesn't have original dups. I like it.

          +1 patch looks good.

          What about failing tests? Are they related?

          Show
          Konstantin Boudnik added a comment - Ah, 'better simpler than clever'! You've avoided callables all together yet the code doesn't have original dups. I like it. +1 patch looks good. What about failing tests? Are they related?
          Hide
          Matt Foley added a comment -

          I had originally rejected trying to factor out those similar-but-different loops, because using callback objects would have increased the code complexity more than was worthwhile. However, you pushed me to look at it again, and there's a better solution Please see if you agree.

          Show
          Matt Foley added a comment - I had originally rejected trying to factor out those similar-but-different loops, because using callback objects would have increased the code complexity more than was worthwhile. However, you pushed me to look at it again, and there's a better solution Please see if you agree.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476383/TestNodeCount_v2.patch
          against trunk revision 1092534.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.datanode.TestBlockReport
          org.apache.hadoop.hdfs.TestFileAppend4
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestLargeBlock
          org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476383/TestNodeCount_v2.patch against trunk revision 1092534. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.datanode.TestBlockReport org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//console This message is automatically generated.
          Hide
          Konstantin Boudnik added a comment -

          I think it makes sense to have a new private method instead of copying pretty much the same code 3 times. Surely the iterators in while loops are different methods but I think you can parametrized it via Callable.

          Show
          Konstantin Boudnik added a comment - I think it makes sense to have a new private method instead of copying pretty much the same code 3 times. Surely the iterators in while loops are different methods but I think you can parametrized it via Callable.
          Hide
          Matt Foley added a comment -

          This patch has correct synchronization using readLock().

          Show
          Matt Foley added a comment - This patch has correct synchronization using readLock().
          Hide
          Matt Foley added a comment -

          In test build 366, we see TestNodeCount fail with NullPointerException in BlockManager. (ref https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/org.apache.hadoop.hdfs.server.namenode/TestNodeCount/testNodeCount/ )

          Such an error certainly can be caused by lack of synchronization. This patch [ 12476098 ] should address it.

          Show
          Matt Foley added a comment - In test build 366, we see TestNodeCount fail with NullPointerException in BlockManager. (ref https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/org.apache.hadoop.hdfs.server.namenode/TestNodeCount/testNodeCount/ ) Such an error certainly can be caused by lack of synchronization. This patch [ 12476098 ] should address it.
          Hide
          Matt Foley added a comment -

          Converted the infinite waits to 20 second timeouts with informative TimeoutException message info.

          I did not find an obvious explanation for the failure, but did find that only one of three calls to namesystem.blockManager.countNodes(block) was correctly synchronized on namesystem. Fixed the other two, following the same pattern. Although, in retrospect, all three really should be replaced by readLock() calls. I'll fix that in the next version.

          Show
          Matt Foley added a comment - Converted the infinite waits to 20 second timeouts with informative TimeoutException message info. I did not find an obvious explanation for the failure, but did find that only one of three calls to namesystem.blockManager.countNodes(block) was correctly synchronized on namesystem. Fixed the other two, following the same pattern. Although, in retrospect, all three really should be replaced by readLock() calls. I'll fix that in the next version.

            People

            • Assignee:
              Matt Foley
              Reporter:
              Matt Foley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development