Queries were still hitting the KDC hard and eventually failed, this is the sequence of events:
- Start running a query
- Impala backends send thousands of TGS_REQ requests to the KDC
- Query would appear to make some progress
- Some connections succeed others fail with Timeout exceeded waiting to connect
- Then I noticed in the logs that idle connections get kill after 65 seconds
reactor.cc:281] Timing out connection server connection from 10.17.229.14:41804 - it has been idle for 65.0002s
- Query takes about 2 minutes and fail
- By then most connections are released since they were idle for > 65 seconds
- New queries go through the same process again and eventually fail
In order to reliably run on a large cluster I believe we need to:
- Change the lifetime of idle connections, possibly extend it to ticket lifetime?
- Create new connections before existing ones expire in a staggered fashion to avoid KDC related failures?
This is the flag rpc_default_keepalive_time_ms