[KUDU-1337] DeleteTablet can cause spurious unfruitful remote bootstraps - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7.0
Fix Version/s: 0.8.0
Component/s: recovery, tserver
Labels:
None

Target Version/s:

0.8.0
Code Review:
http://gerrit.cloudera.org:8080/#/c/2436/

Description

While triaging a cascading YCSB failure, we noticed the following sequence of events:

Client deleted a table.
Master serviced the request.
Master issued DeleteTablet for a particular tablet to a quorum of 3 peers.
Due to load or whatever, the followers received and processed the DeleteTablet before the leader.
The leader noticed the the followers no longer had the tablet, and told them to remote bootstrap it from itself.
The leader began servicing the DeleteTablet.
The followers began remote bootstrapping, which killed the leader due to ~~KUDU-1328~~. If the leader hadn't died, the followers' remote bootstrap sessions would have failed.
There's an open question for this step: is any bad "state" left in the followers? Or do the remote bootstrap sessions abort cleanly?

Anyway, the fact that the replicas handled the DeleteTablet before the leader led to unnecessary remote bootstrap work. We should avoid this.

Note: Todd suspects that delete_table-test's flakiness may be due to this behavior. I didn't look into it, but whomever tackles this should consider that possibility.

Attachments

Issue Links

is related to

KUDU-1451 Restarting a TS that had a lot of deleted tablets takes tens of minutes

Resolved

Activity

People

Assignee:: Mike Percy

Reporter:: Adar Dembo

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Feb/16 00:46

Updated:: 12/May/16 02:47

Resolved:: 08/Mar/16 20:20