[HDDS-9323] Apply expiry of excluded datanodes to writing Ratis keys - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.0
Component/s: Ozone Client
Labels:
- pull-request-available

Description

Better datanode exclude list handling for long-lived clients

Currently it is possible that a long lived client can add most or all nodes of a small cluster to its exclude list, and further writes using that client instance will fail. There are two ways this can be improved:

A timeout to remove nodes from the exclude list after so that they can be retried. For EC, this exists and is configured to 10 minutes by default. Ratis does not currently have this but it should be added. (this task)
Allow the write to fall back to nodes in the exclude list if that is all that is available. This could be implemented on the server side, or as a retry from the client based on the server's initial response. (extracted to ~~HDDS-9551~~)

These issues are especially relevant for S3 gateway, which uses a persistent Ozone client to connect to the cluster while it is up.

Attachments

Issue Links

duplicates

HDDS-6927 XceiverClientRatis: 3 way commit failed on pipeline Pipeline

Resolved

is a parent of

HDDS-9551 Allow the client write to fall back to nodes in the exclude list if that is all that is available

Resolved

links to

GitHub Pull Request #5530

Activity

People

Assignee:: Dave Teng

Reporter:: Ethan Rose

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Sep/23 21:10

Updated:: 21/Nov/23 16:24

Resolved:: 21/Nov/23 16:24