Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3587

Implement smarter back-off strategy for RetriableRpc upon receving REPLICA_NOT_LEADER response

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • client
    • None

    Description

      As of Kudu 1.17.0, the implementation of RetriableRpc for WriteRpc in the C++ client uses linear back-off strategy, where the hold-off time interval (in milliseconds) is computed as

      num_attempts + (rand() % 5)
      

      Even if Kudu servers use separate incoming queues for different RPC interfaces (e.g. TabletServerService, ConsensusService, etc.), in the presence of many active clients, many tablet replicas per tablet server, and on-going Raft election storms due to frozen and/or slow RPC worker threads, many more unrelated write requests might be dropped out of the overflown TabletServerService RPC queues because the queues are flooded with too many retried write requests to tablets whose leader replicas aren't yet established. It doesn't make sense to self-inflict such a DoS condition because of non-optimal RPC retry strategy at the client side.

      One option might be using linear back-off strategy when going round-robin through the recently refreshed list of tablet replicas, but using exponential strategy upon completing a full circle and issuing next GetTablesLocation request to Kudu master.

      Attachments

        Activity

          People

            Unassigned Unassigned
            aserbin Alexey Serbin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: