Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
1.7.0
-
None
Description
There's two common situations where tablets get stuck and can't recover automatically, characterized by the following ksck outputs:
Tombstone + eviction in-flight:
Tablet 796d3d67d6e0429fb5f91c2c7bbd486d of table 'loadgen_auto_802e774c09d74a208330db4c108a7d30' is under-replicated: 1 replica(s) not RUNNING 16204380dc404171bebd99af2504cb14 (wdb-k015-2:7050): RUNNING 61dec96f5aed4cd2a47814de42d721e6 (wdb-k015-3:7050): RUNNING [LEADER] d1689e073948415a901c64a9e9269416 (wdb-k015-1:7050): bad state State: NOT_STARTED Data state: TABLET_DATA_TOMBSTONED Last status: Tablet initializing... 2 replicas' active configs differ from the master's. All the peers reported by the master and tablet servers are: A = 16204380dc404171bebd99af2504cb14 B = 61dec96f5aed4cd2a47814de42d721e6 C = d1689e073948415a901c64a9e9269416 The consensus matrix is: Config source | Voters | Current term | Config index | Committed? ---------------+------------------------+--------------+--------------+------------ master | A B* C | | | Yes A | A B C | 2 | 305 | Yes B | B C | 2 | 307 | No C | [config not available] | | |
Permanently failed + eviction in-flight:
Tablet 796d3d67d6e0429fb5f91c2c7bbd486d of table 'loadgen_auto_802e774c09d74a208330db4c108a7d30' is under-replicated: 1 replica(s) not RUNNING 16204380dc404171bebd99af2504cb14 (wdb-k015-2:7050): RUNNING 61dec96f5aed4cd2a47814de42d721e6 (wdb-k015-3:7050): RUNNING [LEADER] d1689e073948415a901c64a9e9269416 (wdb-k015-1:7050): missing 2 replicas' active configs differ from the master's. All the peers reported by the master and tablet servers are: A = 16204380dc404171bebd99af2504cb14 B = 61dec96f5aed4cd2a47814de42d721e6 C = d1689e073948415a901c64a9e9269416 The consensus matrix is: Config source | Voters | Current term | Config index | Committed? ---------------+------------------------+--------------+--------------+------------ master | A B* C | | | Yes A | A B C | 2 | 305 | Yes B | B C | 2 | 307 | No C | [config not available] | | |
The former case is resolved by tombstoned voting (KUDU-871), while the latter is made much, much less likely by 3-4-3 replication (KUDU-1097).
However, tablets still get stuck on older versions, and it shouldn't be too hard to enhance ksck to detect and automatically fix these two situations by tablet copying B -> C and aborting the config change on B, respectively.
Attachments
Issue Links
- is related to
-
KUDU-2418 ksck should be able to auto-repair single replica tablets (with data loss)
- Open