[HDDS-9823] Pipeline failure should trigger heartbeat immediately - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Implemented
Affects Version/s: None
Fix Version/s: 1.5.0
Component/s: Ozone Datanode, SCM
Labels:
- pull-request-available

Description

XceiverServerRatis#handlePipelineFailure is called in CSM failure scenarios

XceiverServerRatis#handleNodeSlowness
- From StateMachine#notifyFollowerSlowness
- Set to hdds.ratis.rpc.slowness.timeout (default value 300s)
  - Note: Ratis default value is 60s
XceiverServerRatis#handleNoLeader
- From StateMachine#notifyExtendedNoLeader
- Set to hdds.ratis.notification.no-leader.timeout (default value 300s)
  - Note: Ratis default value is 60s
XceiverServerRatis#handleInstallSnapshotFromLeader
- From StateMachine#notifyInstallSnapshotFromLeader

Currently, XceiverServerRatis#handlePipelineFailure does not trigger Heartbeat to SCM immediately. Instead, it waits until the next heartbeat (default 60s) to send the pipeline close action command. This might cause SCM to still allocate blocks to these "failed" pipelines during this duration which might impact on client writing to these blocks.

To minimize the impact on the client and the datanodes on the failed pipeline. I suggest that the datanode trigger the pipeline close command immediately for every pipeline action close command triggered due to pipeline failure.

Attachments

Issue Links

relates to

HDDS-1603 Handle Ratis Append Failure in Container State Machine

Resolved

HDDS-9826 Fix exception handling if one Datanode is not available (Ratis)

Resolved

links to

GitHub Pull Request #5725

Activity

People

Assignee:: Ivan Andika

Reporter:: Ivan Andika

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 04/Dec/23 08:27

Updated:: 29/Jan/24 07:12

Resolved:: 29/Jan/24 07:12