Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1829

TestNodeCount waits forever, errs without giving information

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In three locations in the code, TestNodeCount waits forever on a condition. Failures result in Hudson/Jenkins "Timeout occurred" error message with no information about where or why. Need to replace with TimeoutExceptions that throw a stack trace and useful info about the failure mode.

      Also investigate possible cause of failure.

      1. TestNodeCount.java.patch
        3 kB
        Matt Foley
      2. TestNodeCount_v2.patch
        3 kB
        Matt Foley
      3. 1829_TestNodeCount_v4.patch
        4 kB
        Matt Foley
      4. 1829_TestNodeCount_v4.patch
        4 kB
        Matt Foley

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          2d 18h 5m 1 Matt Foley 14/Apr/11 23:53
          Patch Available Patch Available Resolved Resolved
          12d 6h 39m 1 Eli Collins 27/Apr/11 06:33
          Resolved Resolved Closed Closed
          201d 19h 20m 1 Arun C Murthy 15/Nov/11 00:53
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Konstantin Boudnik made changes -
          Link This issue is related to HDFS-2451 [ HDFS-2451 ]
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #650 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/650/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #609 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/609/)
          HDFS-1829. TestNodeCount waits forever, errs without giving information. Contributed by Matt Foley

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #609 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/609/ ) HDFS-1829 . TestNodeCount waits forever, errs without giving information. Contributed by Matt Foley
          Eli Collins made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Resolution Fixed [ 1 ]
          Hide
          Eli Collins added a comment -

          I've committed this. Thanks Matt!

          Show
          Eli Collins added a comment - I've committed this. Thanks Matt!
          Hide
          Matt Foley added a comment -

          Neither of these failed core tests are related to this patch. No methods of TestNodeCount are called outside of this unit test.

          Ready for commit.

          Show
          Matt Foley added a comment - Neither of these failed core tests are related to this patch. No methods of TestNodeCount are called outside of this unit test. Ready for commit.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12477065/1829_TestNodeCount_v4.patch
          against trunk revision 1095830.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.cli.TestHDFSCLI
          org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery

          +1 contrib tests. The patch passed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//testReport/
          Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12477065/1829_TestNodeCount_v4.patch against trunk revision 1095830. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.cli.TestHDFSCLI org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/407//console This message is automatically generated.
          Matt Foley made changes -
          Attachment 1829_TestNodeCount_v4.patch [ 12477065 ]
          Hide
          Matt Foley added a comment -

          re-upload to trigger Hudson

          Show
          Matt Foley added a comment - re-upload to trigger Hudson
          Hide
          Matt Foley added a comment -

          Thanks, Eli. Opened HDFS-1853 for the suggested refactoring. Also opened HDFS-1852 as an umbrella task for gathering and discussing many possible such improvements.

          Show
          Matt Foley added a comment - Thanks, Eli. Opened HDFS-1853 for the suggested refactoring. Also opened HDFS-1852 as an umbrella task for gathering and discussing many possible such improvements.
          Hide
          Eli Collins added a comment -

          Eli, can we open another Jira for that?

          Absolutely. What you have here is the right direction, we can always refactor the wait to DFSTestUtil later.

          +1 on 1829_TestNodeCount_v4.patch

          Show
          Eli Collins added a comment - Eli, can we open another Jira for that? Absolutely. What you have here is the right direction, we can always refactor the wait to DFSTestUtil later. +1 on 1829_TestNodeCount_v4.patch
          Hide
          Matt Foley added a comment -

          Cos, this patch only modifies TestNodeCount, and TestNodeCount and its methods are not called from outside that unit test module, so other tests failing have nothing to do with this test. However, I am waiting for another run of Hudson and will respond to its results before asking for a commit. Thanks.

          Show
          Matt Foley added a comment - Cos, this patch only modifies TestNodeCount, and TestNodeCount and its methods are not called from outside that unit test module, so other tests failing have nothing to do with this test. However, I am waiting for another run of Hudson and will respond to its results before asking for a commit. Thanks.
          Hide
          Matt Foley added a comment -

          Eli, can we open another Jira for that? My goal here was only to get rid of the recurring false positive in Hudson for this unit test, which this patch does in a clean way. Your suggestion would be a good contribution to the toolkit (and I have been trying to do that along the way, as you know), but I've already spent more time than I should on this issue.

          Show
          Matt Foley added a comment - Eli, can we open another Jira for that? My goal here was only to get rid of the recurring false positive in Hudson for this unit test, which this patch does in a clean way. Your suggestion would be a good contribution to the toolkit (and I have been trying to do that along the way, as you know), but I've already spent more time than I should on this issue.
          Hide
          Eli Collins added a comment -

          How about handling these as is done for HDFS-1562? You could augment NameNodeAdapter#getReplicaInfo to return excess and live replica counts as well and then just add waitFor[Live|Excess]Replicas methods to DFSTestUtil and have TestNodeCount call them. This way we could re-use them in the other replication tests.

          Show
          Eli Collins added a comment - How about handling these as is done for HDFS-1562 ? You could augment NameNodeAdapter#getReplicaInfo to return excess and live replica counts as well and then just add waitFor [Live|Excess] Replicas methods to DFSTestUtil and have TestNodeCount call them. This way we could re-use them in the other replication tests.
          Hide
          Konstantin Boudnik added a comment -

          Ah, 'better simpler than clever'! You've avoided callables all together yet the code doesn't have original dups. I like it.

          +1 patch looks good.

          What about failing tests? Are they related?

          Show
          Konstantin Boudnik added a comment - Ah, 'better simpler than clever'! You've avoided callables all together yet the code doesn't have original dups. I like it. +1 patch looks good. What about failing tests? Are they related?
          Matt Foley made changes -
          Link This issue is part of HDFS-1295 [ HDFS-1295 ]
          Matt Foley made changes -
          Attachment 1829_TestNodeCount_v4.patch [ 12476542 ]
          Hide
          Matt Foley added a comment -

          I had originally rejected trying to factor out those similar-but-different loops, because using callback objects would have increased the code complexity more than was worthwhile. However, you pushed me to look at it again, and there's a better solution Please see if you agree.

          Show
          Matt Foley added a comment - I had originally rejected trying to factor out those similar-but-different loops, because using callback objects would have increased the code complexity more than was worthwhile. However, you pushed me to look at it again, and there's a better solution Please see if you agree.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476383/TestNodeCount_v2.patch
          against trunk revision 1092534.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.datanode.TestBlockReport
          org.apache.hadoop.hdfs.TestFileAppend4
          org.apache.hadoop.hdfs.TestFileConcurrentReader
          org.apache.hadoop.hdfs.TestLargeBlock
          org.apache.hadoop.hdfs.TestWriteConfigurationToDFS

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476383/TestNodeCount_v2.patch against trunk revision 1092534. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.datanode.TestBlockReport org.apache.hadoop.hdfs.TestFileAppend4 org.apache.hadoop.hdfs.TestFileConcurrentReader org.apache.hadoop.hdfs.TestLargeBlock org.apache.hadoop.hdfs.TestWriteConfigurationToDFS -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/370//console This message is automatically generated.
          Hide
          Konstantin Boudnik added a comment -

          I think it makes sense to have a new private method instead of copying pretty much the same code 3 times. Surely the iterators in while loops are different methods but I think you can parametrized it via Callable.

          Show
          Konstantin Boudnik added a comment - I think it makes sense to have a new private method instead of copying pretty much the same code 3 times. Surely the iterators in while loops are different methods but I think you can parametrized it via Callable.
          Matt Foley made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Matt Foley made changes -
          Attachment TestNodeCount_v2.patch [ 12476383 ]
          Hide
          Matt Foley added a comment -

          This patch has correct synchronization using readLock().

          Show
          Matt Foley added a comment - This patch has correct synchronization using readLock().
          Hide
          Matt Foley added a comment -

          In test build 366, we see TestNodeCount fail with NullPointerException in BlockManager. (ref https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/org.apache.hadoop.hdfs.server.namenode/TestNodeCount/testNodeCount/ )

          Such an error certainly can be caused by lack of synchronization. This patch [ 12476098 ] should address it.

          Show
          Matt Foley added a comment - In test build 366, we see TestNodeCount fail with NullPointerException in BlockManager. (ref https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/org.apache.hadoop.hdfs.server.namenode/TestNodeCount/testNodeCount/ ) Such an error certainly can be caused by lack of synchronization. This patch [ 12476098 ] should address it.
          Matt Foley made changes -
          Link This issue is part of HDFS-1295 [ HDFS-1295 ]
          Matt Foley made changes -
          Attachment TestNodeCount.java.patch [ 12476098 ]
          Hide
          Matt Foley added a comment -

          Converted the infinite waits to 20 second timeouts with informative TimeoutException message info.

          I did not find an obvious explanation for the failure, but did find that only one of three calls to namesystem.blockManager.countNodes(block) was correctly synchronized on namesystem. Fixed the other two, following the same pattern. Although, in retrospect, all three really should be replaced by readLock() calls. I'll fix that in the next version.

          Show
          Matt Foley added a comment - Converted the infinite waits to 20 second timeouts with informative TimeoutException message info. I did not find an obvious explanation for the failure, but did find that only one of three calls to namesystem.blockManager.countNodes(block) was correctly synchronized on namesystem. Fixed the other two, following the same pattern. Although, in retrospect, all three really should be replaced by readLock() calls. I'll fix that in the next version.
          Matt Foley made changes -
          Field Original Value New Value
          Description In three locations in the code, TestBalancer waits forever on a condition. Failures result in Hudson/Jenkins "Timeout occurred" error message with no information about where or why. Need to replace with TimeoutExceptions that throw a stack trace and useful info about the failure mode.

          Also investigate possible cause of failure.
          In three locations in the code, TestNodeCount waits forever on a condition. Failures result in Hudson/Jenkins "Timeout occurred" error message with no information about where or why. Need to replace with TimeoutExceptions that throw a stack trace and useful info about the failure mode.

          Also investigate possible cause of failure.
          Matt Foley created issue -

            People

            • Assignee:
              Matt Foley
              Reporter:
              Matt Foley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development