[HDDS-10879] Statemachine transaction resiliency for OM - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Concerns of current state machine:

OM crash and unable to provide service due to use of experimental feature - not stable in the version
OM state machine crash due to bug in specific feature

Impact: Crash of OM / service not available

Solution Points:

1. Need support skip of transaction - through configuration

Many times for the recovery of the system, this needs support skip of the particular transaction. Otherwise the system becomes in-operable.

For code bugs in the operation, users need to make a decision to skip and recover the system.

2. Making operation failure smoothly (without terminating) for specific transaction

It can segregate the type of operation which must crash and which can just fail,

Critical Operation

create/commit and other critical operation which can create in-consistency in system

Non-critical Operation

Internal cleanup, experimental features and other operation which does not create big impact to the system and do not cause data loss and further failure, and repetitive in nature.

The operation needs to be configured in the configuration file for easy control.

3. Failing operation (operation timeout)

Operation taking more time then threshold like, 10 minutes threshold, it should be terminated and making it failure. This is like the operation is stuck and/or the system is not able to complete due to lack of memory / cpu.

These operations should be failed (critical: causing crash of system, non-critical: making it failure) using interrupt.

Already we capture metrics for time taken by these operations.

Configuration of threshold is required.

Idempotent ,, return user with server busy . And and server should check for duplicate.

To discuss: chain get corrupted in snapshot

4. Logging the failed operation

It should log the failed operation terminated abruptly, with operation and transaction Id. This will be useful to know what transaction has failed. (currently, it logs only in normal failure).

5. Alternative approach to crash

Crash mostly happens during ratis transaction (write operation). so instead of crashing, write operation can be disabled, and provide only read operation.

This needs some way that the leader is elected (or node is identified providing service) to provide read service.

- Does this node need to withdraw from being a leader? If all nodes withdraw from being leader, they need to check who will provide read operation.

Boot the system in read-only mode option.

Attachments

Issue Links

relates to

HDDS-10822 Provide a repair tool to omit a raft log

Open

HDDS-10295 Provide an "ozone repair" subcommand to update the snapshot info in transactionInfoTable

Resolved

Sub-Tasks

1.	log transaction Id for every operation including exception causing termination	Resolved	Sumit Agrawal
2.	Add Object ID and Update ID to OM audit log messages	Resolved	Sumit Agrawal
3.	OM state machine move to readonly mode on failure	Resolved	Sumit Agrawal
4.	OM missing audit log for upgrade prepare, cancel and finalize	Resolved	Sumit Agrawal
5.	state machine fail configuration for critical/non-critical operation	Resolved	Sumit Agrawal

Activity

People

Assignee:: Sumit Agrawal

Reporter:: Sumit Agrawal

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 20/May/24 05:39

Updated:: 30/May/24 06:24