[IMPALA-5557] Disable rpc_default_keepalive_time_ms - ASF JIRA

XML

Word

Printable

JSON

Queries were still hitting the KDC hard and eventually failed, this is the sequence of events:

Start running a query
Impala backends send thousands of TGS_REQ requests to the KDC
Query would appear to make some progress
Some connections succeed others fail with Timeout exceeded waiting to connect
Then I noticed in the logs that idle connections get kill after 65 seconds
reactor.cc:281] Timing out connection server connection from 10.17.229.14:41804 - it has been idle for 65.0002s
Query takes about 2 minutes and fail
By then most connections are released since they were idle for > 65 seconds
New queries go through the same process again and eventually fail

In order to reliably run on a large cluster I believe we need to:

Change the lifetime of idle connections, possibly extend it to ticket lifetime?
Create new connections before existing ones expire in a staggered fashion to avoid KDC related failures?

This is the flag rpc_default_keepalive_time_ms

duplicates

IMPALA-5901 KRPC: Client connection negotiation failed

relates to

KUDU-279 RPC fails if it is sent exactly as the keepalive timeout expires

KUDU-2237 Allows idle server connection detection to be disabled