Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Issue:
There was an observed delay in the SCM exiting safe mode, taking slightly over a minute in one instance. This issue is related to the HealthyPipelineSafeModeRule, which depends on DataNodes reporting the health of pipelines to the SCM. Delays in pipeline reporting can impact the time taken to meet the criteria for exiting safe mode, thereby delaying client operations like writes.
Cause:
The delay was caused by the following factors:
- DataNode Registration: DataNodes register with the SCM using heartbeat intervals, which take around 30 seconds. If the leader SCM is restarted and regains leadership, all DataNodes and pipelines need to re-register, leading to delays.
- Pipeline Health Reporting: Pipelines are marked as "healthy" only when all associated DataNodes are active and reporting. This process may take additional time due to the stabilisation of pipelines and network conditions.
Logs:
From the logs:
2024-10-24 00:45:49,630 INFO [main]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: containers with one replica threshold count 0 2024-10-24 00:45:49,636 INFO [main]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Total pipeline count is 4, healthy pipeline threshold count is 1 2024-10-24 00:45:49,639 INFO [main]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Total pipeline count is 4, pipeline's with at least one datanode reported threshold count is 4 2024-10-24 00:45:55,452 INFO [main]-org.apache.hadoop.hdds.scm.ha.SCMContext: Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, preCheckPassed=false} to SafeModeStatus{safeModeStatus=true, preCheckPassed=false}. 2024-10-24 00:46:01,908 INFO [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 4, healthy pipeline threshold count is 1 2024-10-24 00:46:01,911 INFO [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.ContainerSafeModeRule: Refreshed one replica container threshold 0, currentThreshold 0 2024-10-24 00:46:01,912 INFO [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 4, pipeline's with at least one datanode reported threshold count is 4 2024-10-24 00:47:00,770 INFO [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.HealthyPipelineSafeModeRule: Refreshed total pipeline count is 4, healthy pipeline threshold count is 1 2024-10-24 00:47:00,770 INFO [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ContainerSafeModeRule rule is successfully validated 2024-10-24 00:47:00,770 INFO [bc18d0e0-948d-4098-b32b-a7a57c95b9cd@group-B673F31F7F94-StateMachineUpdater]-org.apache.hadoop.hdds.scm.safemode.OneReplicaPipelineSafeModeRule: Refreshed Total pipeline count is 4, pipeline's with at least one datanode reported threshold count is 4 org.apache.hadoop.hdds.scm.exceptions.SCMException: SafeModePrecheck failed for deleteBlocks 2024-10-24 00:47:09,206 INFO [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. 1 DataNodes registered, 1 required. 2024-10-24 00:47:09,303 INFO [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated 2024-10-24 00:47:09,303 INFO [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: All SCM safe mode pre check rules have passed 2024-10-24 00:47:09,303 INFO [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext: Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, preCheckPassed=false} to SafeModeStatus{safeModeStatus=true, preCheckPassed=true}. 2024-10-24 00:47:09,303 INFO [node3-EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule]-org.apache.hadoop.hdds.scm.pipeline.BackgroundPipelineCreator: trigger a one-shot run on node3-RatisPipelineUtilsThread. 2024-10-24 00:47:09,207 INFO [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Pipelines with at least one datanode reported count is 2, required at least one datanode reported per pipeline count is 4 2024-10-24 00:47:09,304 INFO [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Pipelines with at least one datanode reported count is 4, required at least one datanode reported per pipeline count is 4 2024-10-24 00:47:09,304 INFO [node3-EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: AtleastOneDatanodeReportedRule rule is successfully validated 2024-10-24 00:47:09,620 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1 2024-10-24 00:47:09,620 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1 2024-10-24 00:47:09,620 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1 2024-10-24 00:47:09,620 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 0, required healthy pipeline reported count is 1 2024-10-24 00:47:09,935 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count is 1, required healthy pipeline reported count is 1 2024-10-24 00:47:09,935 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: HealthyPipelineSafeModeRule rule is successfully validated 2024-10-24 00:47:09,935 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: ScmSafeModeManager, all rules are successfully validated 2024-10-24 00:47:09,935 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.safemode.SCMSafeModeManager: SCM exiting safe mode. 2024-10-24 00:47:09,936 INFO [node3-EventQueue-OpenPipelineForHealthyPipelineSafeModeRule]-org.apache.hadoop.hdds.scm.ha.SCMContext: Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, preCheckPassed=true} to SafeModeStatus{safeModeStatus=false, preCheckPassed=true}.
The HealthyPipelineSafeModeRule took a significant time to validate the required healthy pipeline threshold, contributing to the delay.
Impact:
Due to this delay, Ozone client write operations were impacted, as they rely on the SCM being fully operational to allocate blocks or create keys.
Resolution:
To prevent write failures during such delays, we propose increasing the following configurations in the Ozone client:
- BLOCK_ALLOCATION_RETRY_WAIT_TIME_MS
- BLOCK_ALLOCATION_RETRY_COUNT
This will allow the client to wait longer (slightly over a minute) for the SCM to exit safe mode, ensuring that writes are not prematurely failed during such scenarios.
Attachments
Issue Links
- links to