[KUDU-1788] Raft UpdateConsensus retry behavior on timeout is counter-productive - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.6.0
Component/s: consensus
Labels:
None

Target Version/s:

1.6.0

Description

In a stress test, I've seen the following counter-productive behavior:

a leader is trying to send operations to a replica (eg a 10MB batch)
the network is constrained due to other activity, so sending 10MB may take >1sec
the request times out on the client side, likely while it was still in the process of sending the batch
when the server receives it, it is likely to have timed out while waiting in the queue. Or ,it will receive it and upon processing will all be duplicate ops from the previous attempt
the client has no idea whether the server received it or not, and thus keeps retrying the same batch (triggering the same timeout)

This tends to be a "sticky"/cascading sort of state: after one such timeout, the follower will be lagging behind more, and the next batch will be larger (up to the configured max batch size). The client neither backs off nor increases its timeout, so it will basically just keep the network pipe full of useless redundant updates

Attachments

Issue Links

is duplicated by

KUDU-2160 Reduce UpdateConsensus RPC timeouts

Resolved

is related to

KUDU-1585 Leader should back off consensus request batch sizes after follower throttling kicks in

Open

relates to

KUDU-699 PeerManager::Close shouldn't block on requests

Resolved

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Dec/16 04:00

Updated:: 05/Oct/17 17:37

Resolved:: 05/Oct/17 17:37