Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3698 Ozone Non-Rolling upgrades
  3. HDDS-5336

Fix datanode capacity related race condition

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      After merging master into the upgrade branch in HDDS-5321, an intermittent failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat: https://github.com/apache/ozone/runs/2787582345

      The issue occurs in SCMNodeManager#register, where the node is added to the nodeStateManager firing the NEW_NODE event, before the node report containing storage information for the new node is processed. The event triggers a one shot run on the background pipeline creator which will read the node's storage information to determine if it can hold a pipeline. If the storage report has not yet been processed when this happens, no pipeline will be created to use the new node when it is registered, because the node still appears to have no free space.

      Relevant log lines from the test failure:

      2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO  net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new node: /default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
      2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN  pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation failed due to no sufficient healthy datanodes. Required 3. Found 2.
      2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO  pipeline.BackgroundPipelineCreator (BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a one-shot run on RatisPipelineUtilsThread.
      2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO  pipeline.RatisPipelineProvider (RatisPipelineProvider.java:lambda$create$0(170)) - Sending CreatePipelineCommand for pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to datanode:b06583c0-2c53-452b-83e4-398ff0104f72
      2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO  pipeline.PipelineStateManager (PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline Pipeline[ Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes: b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/ONE, State:ALLOCATED, leaderId:, CreationTimestamp2021-06-09T21:04:44.088Z].
      2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO  ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) - Invoking method public abstract void org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline) throws java.io.IOException on target org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost 655.117us
      2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN  pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation failed due to no sufficient healthy datanodes with enough space for even a single container. Required 3. Found 2. Container size 5368709120.
      2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO  node.SCMNodeManager (SCMNodeManager.java:register(386)) - Registered Data node : b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
      2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines of type RATIS and factor ONE.
      2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO  node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
      

      Note that the new node is the third node registered, so we would expect a Ratis factor three pipeline to be created after this event. Factor one pipeline creation succeeds for this new node due to HDDS-5337, although this is not related to this test failure.

      Attachments

        Issue Links

          Activity

            People

              erose Ethan Rose
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: