Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1788

Raft UpdateConsensus retry behavior on timeout is counter-productive

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.1.0
    • Fix Version/s: None
    • Component/s: consensus
    • Labels:
      None
    • Target Version/s:

      Description

      In a stress test, I've seen the following counter-productive behavior:

      • a leader is trying to send operations to a replica (eg a 10MB batch)
      • the network is constrained due to other activity, so sending 10MB may take >1sec
      • the request times out on the client side, likely while it was still in the process of sending the batch
      • when the server receives it, it is likely to have timed out while waiting in the queue. Or ,it will receive it and upon processing will all be duplicate ops from the previous attempt
      • the client has no idea whether the server received it or not, and thus keeps retrying the same batch (triggering the same timeout)

      This tends to be a "sticky"/cascading sort of state: after one such timeout, the follower will be lagging behind more, and the next batch will be larger (up to the configured max batch size). The client neither backs off nor increases its timeout, so it will basically just keep the network pipe full of useless redundant updates

        Issue Links

          Activity

          Hide
          tlipcon Todd Lipcon added a comment -

          If not for KUDU-699, we could probably just bump the timeout fairly significantly and be in better shape here

          Show
          tlipcon Todd Lipcon added a comment - If not for KUDU-699 , we could probably just bump the timeout fairly significantly and be in better shape here
          Hide
          tlipcon Todd Lipcon added a comment -

          One way to reproduce this semi-reliably on a cluster:

          • run a load from one host such as:
            kudu perf loadgen vd0342 -num_rows_per_thread 1000000000 -num_threads 8 -norun_scan -table_num_buckets=60 -table_num_replicas 3 
            
          • on one of the TS hosts, drop 1% of packets to the RPC port:
            sudo iptables -A INPUT -p tcp -m statistic --mode random --probability 0.010000 -m tcp --dport 7050 -j DROP 
            
          • Pause the tserver for 10 seconds or so using kill -STOP, and then kill -CONT it.
          • the leader will continue to get a bunch of timeouts and the follower (the paused node) will get a bunch of "deduplicating request" log lines.

          Bumping the RPC timeout to 30sec seemed to prevent this behavior and the lossy follower caught up much faster.

          In testing this I also figured out that linux tracks a number of statistics for each socket such as estimated bandwidth (based on the congestion window) and retransmitted packet count. It might be interesting for us to surface these things somewhere for easier troubleshooting.

          Show
          tlipcon Todd Lipcon added a comment - One way to reproduce this semi-reliably on a cluster: run a load from one host such as: kudu perf loadgen vd0342 -num_rows_per_thread 1000000000 -num_threads 8 -norun_scan -table_num_buckets=60 -table_num_replicas 3 on one of the TS hosts, drop 1% of packets to the RPC port: sudo iptables -A INPUT -p tcp -m statistic --mode random --probability 0.010000 -m tcp --dport 7050 -j DROP Pause the tserver for 10 seconds or so using kill -STOP, and then kill -CONT it. the leader will continue to get a bunch of timeouts and the follower (the paused node) will get a bunch of "deduplicating request" log lines. Bumping the RPC timeout to 30sec seemed to prevent this behavior and the lossy follower caught up much faster. In testing this I also figured out that linux tracks a number of statistics for each socket such as estimated bandwidth (based on the congestion window) and retransmitted packet count. It might be interesting for us to surface these things somewhere for easier troubleshooting.

            People

            • Assignee:
              tlipcon Todd Lipcon
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development