Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.14.0, 1.15.0, 1.16.0, 1.17.0
-
None
Description
The scenario sometimes fails in TSAN builds with output like cited below.
It seems the root cause was RPC queue overflows at kudu-master and kudu-tserver: both spend much more time on regular requests when built with TSAN instrumentation, and resetting the client'ss meta-cache too often induces a lot of GetTableLocations requests, and serving eats a lot of CPU and many threads are kept busy. Since an internal mini-cluster is used in the scenario (i.e. all masters and tablet servers are a part of just one process), that affects kudu-tserver RPC worker threads as well, so many requests accumulate in the RPC queues.
src/kudu/client/client-test.cc:408: Failure Expected equality of these values: 0 server->server()->rpc_server()-> service_pool("kudu.tserver.TabletServerService")-> RpcsQueueOverflowMetric()->value() Which is: 1 src/kudu/client/client-test.cc:584: Failure Expected: CheckNoRpcOverflow() doesn't generate new fatal failures in the current thread. Actual: it does. src/kudu/client/client-test.cc:2466: Failure Expected: DeleteTestRows(client_table_.get(), kLowIdx, kHighIdx) doesn't generate new fatal failures in the current thread. Actual: it does.