Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-12279

TestPipelinesFailover#testPipelineRecoveryStress fails due to race condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • namenode, test
    • None

    Description

      Saw a test failure in a precommit test
      https://builds.apache.org/job/PreCommit-HDFS-Build/20600/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestPipelinesFailover/testPipelineRecoveryStress/

      Error Message
      
      Deferred
      Stacktrace
      
      java.lang.RuntimeException: Deferred
      	at org.apache.hadoop.test.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:130)
      	at org.apache.hadoop.test.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:166)
      	at org.apache.hadoop.hdfs.server.namenode.ha.HAStressTestHarness.shutdown(HAStressTestHarness.java:154)
      	at org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testPipelineRecoveryStress(TestPipelinesFailover.java:493)
      Caused by: java.lang.AssertionError: null
      	at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.addBlocksToBeInvalidated(DatanodeDescriptor.java:641)
      	at org.apache.hadoop.hdfs.server.blockmanagement.InvalidateBlocks.invalidateWork(InvalidateBlocks.java:299)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.invalidateWorkForOneNode(BlockManager.java:4236)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeInvalidateWork(BlockManager.java:1736)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerTestUtil.computeInvalidationWork(BlockManagerTestUtil.java:169)
      	at org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerTestUtil.computeAllPendingWork(BlockManagerTestUtil.java:185)
      	at org.apache.hadoop.hdfs.server.namenode.ha.HAStressTestHarness$1.doAnAction(HAStressTestHarness.java:102)
      	at org.apache.hadoop.test.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:222)
      	at org.apache.hadoop.test.MultithreadedTestUtil$TestingThread.run(MultithreadedTestUtil.java:189)
      

      Studying the code, the assert can only fail due to a race condition that only happens in the test.

      Specifically, the test uses BlockManagerTestUtil to call BlockManager#computeInvalidateWork, which gets invalidateBlocks.getDatanodes(). Afterwards, use the list to perform block invalidation via InvalidateBlocks#invalidateWork, which calls DatanodeDesriptor#addBlocksToBeInvalidated and there is an assertion to ensure the invalidation list is not empty. However, if the BlockManager performs block invalidation before DatanodeDesriptor#addBlocksToBeInvalidated, the invalidation list can be empty, because there's no proper lock to ensure atomicity.

      This is not a problem for real cluster, because there is only one BlockManager per NameNode process, so the potential race condition is not exposed.

      Attachments

        Activity

          People

            Unassigned Unassigned
            weichiu Wei-Chiu Chuang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: