Description
During a YCSB workload, two tservers died due to DNS resolution timeouts. For example:
F0117 09:21:14.952937 8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad status: Network error: Could not obtain a remote proxy to the peer.: Unable to resolve address 've0130.halxg.cloudera.com': Name or service not known
It's not clear why this happened; perhaps table creation places an inordinate strain on DNS due to concurrent resolution load from all the bootstrapping peers.
In any case, when these tservers were restarted, two tablets failed to bootstrap, both for the same reason. I'll focus on just one tablet from here on out to simplify troubleshooting:
E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet failed to bootstrap: Not found: Unable to load Consensus metadata: /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or directory (error 2)
Eventually, the master decided to delete this tablet:
I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet for tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet for tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet for tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet for tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153
As can be seen by the presence of multiple deletion requests, each one failed. It's annoying that the tserver didn't log why. But the master did:
I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 29) W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete failed for tablet 8c167c441a7d44b8add737d13797e694 with error code TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting down I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)...
This isn't a fatal error as far as the master is concerned, so it retries the deletion forever.
Meanwhile, the broken replica of this tablet still appears to be part of the replication group. At least, that's true as far as both the master web UI and the tserver web UI are concerned. The leader tserver is logging this error repeatedly:
W0117 16:38:04.797828 81809 consensus_peers.cc:329] T 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet 8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12). Status: Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load Consensus metadata: /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or directory (error 2). Retrying in the next heartbeat period. Already tried 6666 times.
It's not clear to me exactly what state the replication group is in. The master did issue an AddServer request:
I0117 15:42:32.117065 33903 catalog_manager.cc:3069] Started AddServer task for tablet 8c167c441a7d44b8add737d13797e694
But the leader of the tablet still thinks the broken replica is in the replication group. So is it a tablet with two healthy replicas and one broken one, that can't recover? Maybe.
So a couple things are broken here:
- Table creation probably created a DNS resolution storm.
- Failure in DNS resolution is not retried, and led to tserver death.
- On bootstrap, this replica was detected as having a tablet-meta file but no consensus-meta, and was set aside as corrupt (good). But the lack of a consensus-meta means there's no consensus state and so the tserver cannot perform an "atomic delete" as requested by the master. Must we manually delete this replica? Or should the master be able to force the issue?
- The tserver did not log the tablet deletion failure.
- The master retried the deletion in perpetuity.
- Re-replication of this tablet by the leader appears to be broken.
I think at least some of these issues are tracked in other JIRAs.