During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections.
Here's a repro:
Somewhere from 5-50 connections fail on my machine with "connect(): timed out" error messages. This happens when the socket sits in the server's accept queue for too long.
The server runs accept() in a single thread, and then does all the work of starting that server-side handler in the same thread, which includes creating a new thread, taking a lock, creating transports and so on.
The important thing is to move sockets from waiting-for-accept to the accepted state. It's ok if there's some delay between being accepted, and the connection being completely set up. So the easiest thing to do is to add a small thread pool to the TThreadedServer which handles every aspect of connection set-up except for accept() itself - leaving the main server thread to accept() and Offer() and nothing else. Even if the thread pool has one thread, there's a benefit if the thread pool's queue is large enough to buffer spikes in connection requests.
With a prototype in place, and one thread in the thread pool, it took ~35s to accept 16k connections, a rate of ~200 accepts per second. Without the patch, that rate drops to ~60 per second. I'm not sure what limits the thread pool solutions - maybe the queue gets full or maybe the test driver that's opening the connections has a limit.
This will be fixed longer term by
IMPALA-2567, but we can reduce the pain this causes for larger clusters with more complex queries here.