[KUDU-2020] tserver failure causes multiple tablet copy operations per under-replicated tablet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: 1.3.2, 1.4.0
Component/s: tserver
Labels:
None

Description

Combing through logs of a cluster recovering from tserver failure and reading the tablet copy code revealed that if a tserver's tablet copy threadpool is full (which defaults to 10 slots), and a duplicate tablet copy request is received from a leader, the tablet server will respond with a THROTTLED error instead of an ALREADY_INPROGRESS error. If this situation continues for 300 seconds, the leader will eject the replica, which has the effect of starting anothe tablet copy on another tserver. Meanwhile, the original tablet copy may still be making progress in the first follower tserver.

As a result, a single tablet can be in the process of being copied multiple times if a cluster is undergoing many tablet copies, for instance if a tserver with hundreds of tablets fails.

It's expected that a cluster which is recovering from tserver failure would have roughly equal aggregate read and write IO throughput, however because of this bug, the recovering cluster exhibited significantly more write disk throughput than read, since many tablet copies were served from cache. Attached is a representative graph. The spike around 5:07 is when fsyncs were disabled cluster wide (fsync tablet copy overhead is being tracked in ~~KUDU-1726~~).

Thanks to Adar and Todd for helping me track this down.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
19/May/17 01:57
139 kB
Dan Burkert

Activity

People

Assignee:: Dan Burkert

Reporter:: Dan Burkert

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 19/May/17 01:56

Updated:: 06/Jun/17 06:17

Resolved:: 31/May/17 21:43