Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-4766

Recon resets the Operational State of datanodes to IN_SERVICE

    XMLWordPrintableJSON

Details

    Description

      When a datanode is decommission or put to maintenance, its new state is persisted into the datanode.yaml file. When running on a cluster with Recon enabled, we can see conflicting commands are received repeatedly on the Datanode, eg:

      datanode_3  | 2021-01-29 16:26:20,009 [EndpointStateMachine task thread for scm/172.24.0.6:9861 - 0 ] INFO endpoint.HeartbeatEndpointTask: Received SCM set operational state command. State: DECOMMISSIONED Expiry: 0 id 3645344
      datanode_3  | 2021-01-29 16:26:50,012 [EndpointStateMachine task thread for recon/172.24.0.3:9891 - 0 ] INFO commands.SetNodeOperationalStateCommand: Create a new command to set op state IN_SERVICE 0 id is 3675347
      

      This is happening because Recon delegates processing the DN heartbeats received by ReconNodeManager to an instance of SCMNodeManager running inside Recon. SCMNodeManager checks the reported state of the datanode matches the SCM memory state, and if they don't match, it issues a command to the DN to update its state.

      In this case, Recon always tries to set the DN state back to IN_SERVICE.

      The fix here, is probably to update the Recon in memory state before delegating the heartbeat to SCMNodeManager.

      Attachments

        Issue Links

          Activity

            People

              sodonnell Stephen O'Donnell
              nilotpalnandi Nilotpal Nandi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: