Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-9055

Datanode decommission Failed, Follower never received the command

    XMLWordPrintableJSON

Details

    Description

      Issue:
      As per one of the Cloudera system test, 2 Datanode are scheduled for decommission post data write and data pipeline close.
      LEADER node has received the scheduled command for decommission as expected from the test, But the FOLLOWER never received the decommission.

      Summary logs :
      Follower

      19:58:04,931 : persistedOpState: DECOMMISSIONING, the value stored in SCM (IN_SERVICE, 0)
      19:58:10,016 : persistedOpState: IN_SERVICE,  the value stored in SCM (DECOMMISSIONING, 0)
      
      

      Leader: TimeOut

      2023-07-20 19:38:31,689 : persistedOpState: IN_SERVICE, the value stored in SCM (DECOMMISSIONING, 0)
      ...... multiple retries .......
      2023-07-20 19:55:54,323 : persistedOpState: IN_SERVICE, the value stored in SCM (DECOMMISSIONING, 0)
      2023-07-20 19:56:24,344 : persistedOpState: IN_SERVICE, the value stored in SCM (DECOMMISSIONING, 0)
      
      2023-07-20 19:58:04,931 : persistedOpState: DECOMMISSIONING, the value stored in SCM (IN_SERVICE, 0)
      

      Detailed logs :

      FOLLOWER
      
      2023-07-20 19:58:04,931 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Update the operationalState saved in follower SCM for 33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: 70976812254805668, persistedOpState: DECOMMISSIONING, persistedOpStateExpiryEpochSec: 0} as the reported value does not match the value stored in SCM (IN_SERVICE, 0)
      
      2023-07-20 19:58:10,016 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Update the operationalState saved in follower SCM for 33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: 70976812254805668, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} as the reported value does not match the value stored in SCM (DECOMMISSIONING, 0)
      
      
      LEADER
      
      2023-07-20 19:56:24,344 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Scheduling a command to update the operationalState persisted on 33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: 70976812254805668, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0} as the reported value does not match the value stored in SCM (DECOMMISSIONING, 0)
      
      2023-07-20 19:58:04,931 INFO org.apache.hadoop.hdds.scm.node.SCMNodeManager: Scheduling a command to update the operationalState persisted on 33c95701-aaa5-4b08-a56b-70ac5d237187{ip: 172.27.12.66, host: quasar-zqlpfe-5.quasar-zqlpfe.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default-rack, certSerialId: 70976812254805668, persistedOpState: DECOMMISSIONING, persistedOpStateExpiryEpochSec: 0} as the reported value does not match the value stored in SCM (IN_SERVICE, 0)
      

      PFA SCM logs for more details

      Attachments

        Issue Links

          Activity

            People

              sumitagrawal Sumit Agrawal
              ssulav Soumitra Sulav
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: