Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-540

TestNameNodeMetrics fails intermittently

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.21.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Tags:
      ygridqa

      Description

      TestNameNodeMetrics has strict timing constraint that relies on block management functionality and can fail intermittently.

      1. HDFS-540.patch
        8 kB
        Suresh Srinivas
      2. HDFS-540.patch
        8 kB
        Suresh Srinivas

        Issue Links

          Activity

          Show
          gary murry added a comment - We are still seeing this: http://hudson.zones.apache.org/hudson/view/Hadoop/job/Hadoop-Hdfs-trunk/57/testReport/org.apache.hadoop.hdfs.server.namenode.metrics/TestNameNodeMetrics/testCorruptBlock/
          Hide
          Suresh Srinivas added a comment -

          I will work on this next week

          Show
          Suresh Srinivas added a comment - I will work on this next week
          Hide
          Suresh Srinivas added a comment -

          Previous test expects pendingReplicationBlocks and scheduledReplicationBlocks metrics to be incremented by 1, when a block is corrupted. The update of these metrics varies in time as it depends on replication processing and interaction between nodes. Removing testing for pendingReplicationBlocks and scheduledReplicationBlocks metrics to handle this uncertainty that could trigger intermittent test failures.

          Additional changes:

          1. Instead of starting and shutting down cluster in Test setUp() and tearDown(), an instance of cluster is started in each test. This will help in starting the cluster with required number of nodes (instead of every test having to use the same setup).
          2. Change the name of the files created in tests to the test names to make debugging of failures easy.
          Show
          Suresh Srinivas added a comment - Previous test expects pendingReplicationBlocks and scheduledReplicationBlocks metrics to be incremented by 1, when a block is corrupted. The update of these metrics varies in time as it depends on replication processing and interaction between nodes. Removing testing for pendingReplicationBlocks and scheduledReplicationBlocks metrics to handle this uncertainty that could trigger intermittent test failures. Additional changes: Instead of starting and shutting down cluster in Test setUp() and tearDown() , an instance of cluster is started in each test. This will help in starting the cluster with required number of nodes (instead of every test having to use the same setup). Change the name of the files created in tests to the test names to make debugging of failures easy.
          Hide
          Suresh Srinivas added a comment -

          Updated patch deletes an unused import.

          Show
          Suresh Srinivas added a comment - Updated patch deletes an unused import.
          Hide
          Kan Zhang added a comment -

          I think the test failed because it expects pendingReplicationBlocks and scheduledReplicationBlocks metrics to be reduced to 0 from 1, when the block is deleted. However, when a block is deleted, only corruptReplicas data structure is updated synchronously, while pendingReplicationBlocks isn't. If pendingReplicationBlocks is also synchronously updated, both pendingReplicationBlocks and scheduledReplicationBlocks metrics verification should be successful since they are updated at the same time as the corruptReplicas metrics. The question is whether we want to update pendingReplicationBlocks data structure synchronously. Can someone familiar with FSNamesystem comment on this?

          Show
          Kan Zhang added a comment - I think the test failed because it expects pendingReplicationBlocks and scheduledReplicationBlocks metrics to be reduced to 0 from 1, when the block is deleted. However, when a block is deleted, only corruptReplicas data structure is updated synchronously, while pendingReplicationBlocks isn't. If pendingReplicationBlocks is also synchronously updated, both pendingReplicationBlocks and scheduledReplicationBlocks metrics verification should be successful since they are updated at the same time as the corruptReplicas metrics. The question is whether we want to update pendingReplicationBlocks data structure synchronously. Can someone familiar with FSNamesystem comment on this?
          Hide
          Kan Zhang added a comment -

          I don't see the reason why pendingReplicationBlocks needs to be updated synchronously. If so, +1 for the patch.

          Show
          Kan Zhang added a comment - I don't see the reason why pendingReplicationBlocks needs to be updated synchronously. If so, +1 for the patch.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Got a different failure case:

          Testcase: testCorruptBlock took 2.684 sec
          	FAILED
          expected:<1> but was:<0>
          junit.framework.AssertionFailedError: expected:<1> but was:<0>
          	at org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics.testCorruptBlock(TestNameNodeMetrics.java:114)
          

          The following is the one Gary mentioned earlier.

          expected:<0> but was:<1>
          junit.framework.AssertionFailedError: expected:<0> but was:<1>
          	at org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics.testCorruptBlock(TestNameNodeMetrics.java:118)
          
          Show
          Tsz Wo Nicholas Sze added a comment - Got a different failure case: Testcase: testCorruptBlock took 2.684 sec FAILED expected:<1> but was:<0> junit.framework.AssertionFailedError: expected:<1> but was:<0> at org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics.testCorruptBlock(TestNameNodeMetrics.java:114) The following is the one Gary mentioned earlier. expected:<0> but was:<1> junit.framework.AssertionFailedError: expected:<0> but was:<1> at org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics.testCorruptBlock(TestNameNodeMetrics.java:118)
          Hide
          Suresh Srinivas added a comment -

          This test has not been failing for a long time. Will create a new jira if it occurs again with the patch attached to this jira.

          Show
          Suresh Srinivas added a comment - This test has not been failing for a long time. Will create a new jira if it occurs again with the patch attached to this jira.

            People

            • Assignee:
              Suresh Srinivas
              Reporter:
              Suresh Srinivas
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development