[HDDS-9134] GRPC based replication can get stuck forever if the receiver is not available - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: Ozone Datanode
Labels:
- pull-request-available

Target Version/s:

1.4.0

Description

Decommission of a DN does not complete even after 40 mins, since there are still 7 replicas in under replicated state.

We are seeing this issue across multiple runs for some of the decommissioning test cases.

SCM logs:

2023-08-04 00:51:13,994 INFO [IPC Server handler 3 on 9860]-org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Starting Decommission for node b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
2023-08-04 00:51:13,994 INFO [EventQueue-HealthyReadonlyToHealthyNodeForReadOnlyHealthyToHealthyNodeHandler]-org.apache.hadoop.hdds.scm.node.ReadOnlyHealthyToHealthyNodeHandler: Datanode b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136) moved to HEALTHY state.
2023-08-04 00:51:13,994 INFO [EventQueue-HealthyReadonlyToHealthyNodeForReadOnlyHealthyToHealthyNodeHandler]-org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator: trigger a one-shot run on RatisPipelineUtilsThread.
2023-08-04 00:51:13,996 WARN [RatisPipelineUtilsThread - 0]-org.apache.hadoop.hdds.scm.pipeline.PipelinePlacementPolicy: Pipeline creation failed due to no sufficient healthy datanodes. Required 3. Found 2. Excluded 6.
2023-08-04 00:51:18,083 INFO [IPC Server handler 51 on 9861]-org.apache.hadoop.hdds.scm.node.SCMNodeManager: Scheduling a command to update the operationalState persisted on b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136) as the reported value (IN_SERVICE, 0) does not match the value stored in SCM (DECOMMISSIONING, 0)


2023-08-04 00:53:00,505 INFO [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136) has 136 sufficientlyReplicated, 128 underReplicated and 4 unhealthy containers
2023-08-04 00:53:00,505 INFO [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: There are 1 nodes tracked for decommission and maintenance.  0 pending nodes.


2023-08-04 01:34:00,502 INFO [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Under Replicated Container #26169 Container State: CLOSED, Replicas: (Count: 5, Healthy: 4, Decommission: 1, PendingAdd: 1), ReplicationConfig: EC{rs-3-2-1024k}, RemainingMaintenanceRedundancy: 1; Replicas{ContainerReplica{containerID=#26169, state=CLOSED, datanodeDetails=1a2bc52e-8694-4e07-ba14-78c5f91e1e32(quasar-xvihtz-3.quasar-xvihtz.root.hwx.site/172.27.25.10), placeOfBirth=1a2bc52e-8694-4e07-ba14-78c5f91e1e32, sequenceId=0, keyCount=4, bytesUsed=252,replicaIndex=4, isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED, datanodeDetails=7f516e1e-980b-421d-9eb7-43889e33346b(quasar-xvihtz-1.quasar-xvihtz.root.hwx.site/172.27.114.66), placeOfBirth=7f516e1e-980b-421d-9eb7-43889e33346b, sequenceId=0, keyCount=4, bytesUsed=252,replicaIndex=1, isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED, datanodeDetails=b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136), placeOfBirth=ee6926fa-1969-4e9a-bd43-e328c3b21a2f, sequenceId=0, keyCount=4, bytesUsed=252,replicaIndex=5, isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED, datanodeDetails=240fe27d-3c00-4413-a775-6b8894980d8c(quasar-xvihtz-4.quasar-xvihtz.root.hwx.site/172.27.186.70), placeOfBirth=240fe27d-3c00-4413-a775-6b8894980d8c, sequenceId=0, keyCount=4, bytesUsed=0,replicaIndex=2, isEmpty=false},ContainerReplica{containerID=#26169, state=CLOSED, datanodeDetails=53b8df38-70c4-459f-a3c9-83fce7d947e5(quasar-xvihtz-5.quasar-xvihtz.root.hwx.site/172.27.103.128), placeOfBirth=53b8df38-70c4-459f-a3c9-83fce7d947e5, sequenceId=0, keyCount=4, bytesUsed=0,replicaIndex=3, isEmpty=false}}
2023-08-04 01:34:00,502 INFO [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136) has 256 sufficientlyReplicated, 7 underReplicated and 0 unhealthy containers
2023-08-04 01:34:00,502 INFO [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: There are 1 nodes tracked for decommission and maintenance.  0 pending nodes.


2023-08-04 01:34:25,934 INFO [IPC Server handler 44 on 9860]-org.apache.hadoop.hdds.scm.node.NodeDecommissionManager: Queued node b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136) for recommission
2023-08-04 01:34:30,501 INFO [DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl: Recommissioned node b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136)
2023-08-04 01:34:30,501 INFO [EventQueue-HealthyReadonlyToHealthyNodeForReadOnlyHealthyToHealthyNodeHandler]-org.apache.hadoop.hdds.scm.node.ReadOnlyHealthyToHealthyNodeHandler: Datanode b4d59f24-753c-40a0-935a-94495b700694(quasar-xvihtz-6.quasar-xvihtz.root.hwx.site/172.27.173.136) moved to HEALTHY state.

The decommissioning was submitted at 00:51:13 and the DatanodeAdminMonitorImpl identified 128 under replicated and 4 unhealthy containers. But at 01:34:00 after more than 40 mins there were still 7 under replicated containers left.

The test case then aborted the decommissioning command and recommissioned the DN.

Attachments

Issue Links

is caused by

HDDS-9081 Handling GRPC/Netty back pressure when streaming containers for replication.

Resolved

links to

GitHub Pull Request #5161

GRPC based replication can get stuck forever if the receiver is not available

Details

Description

Attachments

Issue Links

Activity

People

Dates