[KUDU-3134] Adjust default value for --raft_heartbeat_interval - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.12.0
Fix Version/s: None
Component/s: None
Labels:
None

Target Version/s:

1.13.0

Description

Users often increase the `--raft_heartbeat_interval` on larger clusters or on clusters with high replica counts. This helps avoid the servers flooding each other with heartbeat RPCs causing queue overflows and using too much idle CPU. Users have adjusted the values from 1.5 seconds to as high as 10s and we have never seen people complain about problems after doing so.

Anecdotally, I recently saw a cluster with 4k tablets per tablet server using ~150% cpu usage while idle. By increasing the `--raft_heartbeat_interval` from 500ms to 1500ms the cpu usage dropped to ~50%.

Generally speaking users often care about Kudu stability and scalability over an extremely short MTTR. Additionally our default client RPC timeouts of 30s also seem to indicate slightly longer failover/retry times are tolerable in the default case.

We should consider adjusting the default value of `-raft_heartbeat_interval` to a higher value to support larger and more efficient clusters by default. Users who need a low MTTR can always adjust the value lower while also adjusting other related timeouts. We may also want to consider adjusting the default `-heartbeat_interval_ms` accordingly.

Note: Batching the RPCs like mentioned in KUDU-1973 or providing a server to server proxy for heartbeating may be a way to solve the issues without adjusting the default configuration. However, adjusting the configuration is easy and has proven effective in production deployments. Additionally adjusting the defaults along with a KUDU-1973 like approach could lead to even lower idle resource usage.

Attachments

Issue Links

relates to

KUDU-1973 Coalesce RPCs destined for the same server

Open

IMPALA-11154 Idle Kudu daemons consume too much CPU

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Grant Henke

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/May/20 03:27

Updated:: 25/Feb/22 19:51