Details
-
Task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.7.0
-
None
Description
While triaging a cascading YCSB failure, we noticed the following sequence of events:
- Client deleted a table.
- Master serviced the request.
- Master issued DeleteTablet for a particular tablet to a quorum of 3 peers.
- Due to load or whatever, the followers received and processed the DeleteTablet before the leader.
- The leader noticed the the followers no longer had the tablet, and told them to remote bootstrap it from itself.
- The leader began servicing the DeleteTablet.
- The followers began remote bootstrapping, which killed the leader due to
KUDU-1328. If the leader hadn't died, the followers' remote bootstrap sessions would have failed. - There's an open question for this step: is any bad "state" left in the followers? Or do the remote bootstrap sessions abort cleanly?
Anyway, the fact that the replicas handled the DeleteTablet before the leader led to unnecessary remote bootstrap work. We should avoid this.
Note: Todd suspects that delete_table-test's flakiness may be due to this behavior. I didn't look into it, but whomever tackles this should consider that possibility.
Attachments
Issue Links
- is related to
-
KUDU-1451 Restarting a TS that had a lot of deleted tablets takes tens of minutes
- Resolved