Details
-
Epic
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
Ignite 3 Disaster Recovery
Description
This epic is related to issues that may happen with users when part of their data becomes unavailable for some reasons, like "node is lost", or "part of the storage is lost", etc.
Following definitions will be used throughout:
Local partition states. A local property of replica, storage, state machine, etc., associated with the partition:
- Healthy
State machine is running, everything’s fine. - Initializing
Ignite node is online, but the corresponding raft group is yet to complete its initialization. - Snapshot installation
Full state transfer is taking place. Once it’s finished, the partition will become healthy or catching-up. Before that, data can’t be read, and log replication is also on pause. - Catching-up
Node is in the process of replicating data from the leader, and its data is a little bit in the past. This state can only be observed from the leader, because only the leader has the latest committed index and the state of every peer. - Broken
Something’s wrong with the state machine. Some data might be unavailable for reading, log can’t be replicated, and this state won’t be changed automatically without intervention.
Global partition states. A global property of a partition, that specifies its apparent functionality from user’s point of view:
- Available partition
Healthy partition that can process read and write requests. This means that the majority of peers are healthy at the moment. - Read-only partition
Partition that can process read requests, but can’t process write requests. There’s no healthy majority, but there’s at least one alive (healthy/catch-up) peer that can process historical read-only queries. - Unavailable partition
Partition that can’t process any requests.
Building blocks are a set of operations that can be executed by Ignite or by the user in order to improve cluster state.
Each building block must either be an automatic action with configurable timeout (if applicable), or a documented API, with mandatory diagnostics/metrics that would allow users to make decisions about these actions.
- Offline Ignite node is brought back online, having all recent data.
Not a disaster recovery mechanism, but worth mentioning.
A node with usable data, that doesn’t require full state transfer, will become a peer, will participate in voting and replication, allowing partition to be available if majority is healthy. This is the best case for the user, where they simply restart offline nodes and the cluster continues being operable. - Automatic group scale-down.
Should happen when an Ignite node is offline for too long.
Not a disaster recovery mechanism, but worth mentioning.
Only happens when the majority is online, meaning that user data is safe. - Manual partition restart.
Should be performed manually for broken peers. - Manual group peers/learners reconfiguration.
Should be performed on a group manually, if the majority is considered permanently lost. - Freshly re-entering the group.
Should happen when an Ignite node is returned back to the group, but partition data is missing. - Cleaning the partition data.
If, for some reason, we know that a certain partition on a certain node is broken, we may ask Ignite to drop its data and re-enter the group empty (as stated in option 5).
Having a dedicated operation for cleaning the partition is preferable, because:- partition is be stored in several storages
- not all of them have a “file per partition” storage format, not even close
- there’s also raft log that should be cleaned, most likely
- maybe raft meta as well
- Partial truncation of the log’s suffix.
This is a case of partial cleanup of partition data. This operation might be useful if we know that there’s junk in the log, but storages are not corrupted, so there’s a chance to save some data. Can be replaced with “clean partition data”.
In order for the user to make decisions about manual operations, we must provide partition states for all partitions in all tables/zones. Both global and local states. Global states are more important, because they directly correlate with user experience.
Some states will automatically lead to “available” partitions, if the system overall is healthy and we simply wait for some time. For example, we wait until a snapshot installation, or a rebalance is complete, and we’re happy. This is not considered a building block, because it’s a natural artifact of the architecture.
Current list is not exhaustive, it consists of basic actions that we could implement that would cover a wide range of potential issues.
Any other addition to the list of basic blocks would simply refine it, potentially allowing users to recover faster, or with less data being lost, or without data loss at all. All such refinements will be considered in later releases.