Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3134

Adjust default value for --raft_heartbeat_interval

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.12.0
    • None
    • None
    • None

    Description

      Users often increase the `--raft_heartbeat_interval` on larger clusters or on clusters with high replica counts. This helps avoid the servers flooding each other with heartbeat RPCs causing queue overflows and using too much idle CPU. Users have adjusted the values from 1.5 seconds to as high as 10s and we have never seen people complain about problems after doing so.

      Anecdotally, I recently saw a cluster with 4k tablets per tablet server using ~150% cpu usage while idle. By increasing the `--raft_heartbeat_interval` from 500ms to 1500ms the cpu usage dropped to ~50%.

      Generally speaking users often care about Kudu stability and scalability over an extremely short MTTR. Additionally our default client RPC timeouts of 30s also seem to indicate slightly longer failover/retry times are tolerable in the default case.

      We should consider adjusting the default value of `-raft_heartbeat_interval` to a higher value to support larger and more efficient clusters by default. Users who need a low MTTR can always adjust the value lower while also adjusting other related timeouts. We may also want to consider adjusting the default `-heartbeat_interval_ms` accordingly.

      Note: Batching the RPCs like mentioned in KUDU-1973 or providing a server to server proxy for heartbeating may be a way to solve the issues without adjusting the default configuration. However, adjusting the configuration is easy and has proven effective in production deployments. Additionally adjusting the defaults along with a KUDU-1973 like approach could lead to even lower idle resource usage.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              granthenke Grant Henke
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: