Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 2.7.0
-
None
Description
During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections.
Here's a repro:
ThreadPool<int64_t> pool("group", "test", 128, 10000, [](int tid, const int64_t& item) { using Client = ThriftClient<ImpalaInternalServiceClient>; Client* client = new Client("127.0.0.1", 22000, "", NULL, false); Status status = client->Open(); if (status.ok()) { LOG(INFO) << "Socket " << item << " -> OK"; } else { LOG(INFO) << "Socket " << item << " -> Failed: " << status.GetDetail(); } }); for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i); pool.DrainAndShutdown();
Somewhere from 5-50 connections fail on my machine with "connect(): timed out" error messages. This happens when the socket sits in the server's accept queue for too long.
The server runs accept() in a single thread, and then does all the work of starting that server-side handler in the same thread, which includes creating a new thread, taking a lock, creating transports and so on.
The important thing is to move sockets from waiting-for-accept to the accepted state. It's ok if there's some delay between being accepted, and the connection being completely set up. So the easiest thing to do is to add a small thread pool to the TThreadedServer which handles every aspect of connection set-up except for accept() itself - leaving the main server thread to accept() and Offer() and nothing else. Even if the thread pool has one thread, there's a benefit if the thread pool's queue is large enough to buffer spikes in connection requests.
With a prototype in place, and one thread in the thread pool, it took ~35s to accept 16k connections, a rate of ~200 accepts per second. Without the patch, that rate drops to ~60 per second. I'm not sure what limits the thread pool solutions - maybe the queue gets full or maybe the test driver that's opening the connections has a limit.
This will be fixed longer term by IMPALA-2567, but we can reduce the pain this causes for larger clusters with more complex queries here.
Attachments
Issue Links
- relates to
-
IMPALA-4069 Introduce startup option to create and cache backend connections on startup
- Resolved