Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2201

Write RPC reports negative timeout

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.6.0
    • None
    • client, rpc
    • None
    • A machine with relatively small amount of RAM: 8GB

      Both tablet server and the kudu CLI tool were running at the same machine.

    Description

      dan@danburkert.com ran the following twice on a machine with 8GB of RAM, while running the tablet server there as well:

      $ kudu perf loadgen kudu-training-1-1.gce.cloudera.com --table-name=clicks --use-random --num-rows-per-thread=10000000
      

      The subsequent run uses the same random seed, so all 20M rows were duplicates, which caused 20M insertion errors. The OOM killer killed the tablet server, but the client also experienced some errors, which lead to bad-alloc exception and reporting of negative timeout:

      W1025 16:03:44.394037 14071 batcher.cc:349] Timed out: Failed to write batch of 506 ops to tablet 915e587ef6df47e8aae8a6c36cc37d56 after 1 attempt(s): Failed to write to server: bf76454db52c4b9f82e29a2783621439 (kudu-training-1-4.gce.cloudera.com:7050): Write RPC to 172.31.120.27:7050 timed out after -15.216s (ON_OUTBOUND_QUEUE)
      

      The tail of the error output from 'kudu perf loadgen':

      W1025 16:03:44.393877 14071 connection.cc:221] RPC call timeout handler was delayed by 18.8709s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 0.999165s for a response.
      W1025 16:03:44.393882 14071 connection.cc:221] RPC call timeout handler was delayed by 18.8466s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 0.999261s for a response.
      W1025 16:03:44.393888 14071 connection.cc:221] RPC call timeout handler was delayed by 18.8442s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 0.99953s for a response.
      W1025 16:03:44.393893 14071 connection.cc:221] RPC call timeout handler was delayed by 18.8419s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 0.999786s for a response.
      W1025 16:03:44.393898 14071 connection.cc:221] RPC call timeout handler was delayed by 18.8399s! This may be due to a process-wide pause such as swapping, logging-related delays, or allocator lock contention. Will allow an additional 0.999999s for a response.
      W1025 16:03:44.394018 14071 meta_cache.cc:217] Tablet 915e587ef6df47e8aae8a6c36cc37d56: Replica bf76454db52c4b9f82e29a2783621439 (kudu-training-1-4.gce.cloudera.com:7050) has failed: Timed out: Write RPC to 172.31.120.27:7050 timed out after -15.216s (ON_OUTBOUND_QUEUE)
      W1025 16:03:44.394037 14071 batcher.cc:349] Timed out: Failed to write batch of 506 ops to tablet 915e587ef6df47e8aae8a6c36cc37d56 after 1 attempt(s): Failed to write to server: bf76454db52c4b9f82e29a2783621439 (kudu-training-1-4.gce.cloudera.com:7050): Write RPC to 172.31.120.27:7050 timed out after -15.216s (ON_OUTBOUND_QUEUE)
      W1025 16:04:34.604554 14071 outbound_call.cc:286] RPC callback for RPC call kudu.tserver.TabletServerService.Write -> {remote=172.31.120.27:7050 (kudu-training-1-4.gce.cloudera.com), user_credentials={real_user=root}} blocked reactor thread for 65614.7us
      terminate called after throwing an instance of 'std::bad_alloc'
        what():  std::bad_alloc
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            aserbin Alexey Serbin
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: