Uploaded image for project: 'IMPALA'
  2. IMPALA-4135

Thrift threaded server times-out waiting connections during high load




      During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections.

      Here's a repro:

      ThreadPool<int64_t> pool("group", "test", 128, 10000, [](int tid, const int64_t& item) {
              using Client = ThriftClient<ImpalaInternalServiceClient>;
              Client* client = new Client("", 22000, "", NULL, false);
              Status status = client->Open();
              if (status.ok()) {
                LOG(INFO) << "Socket " << item << " -> OK";
              } else {
                LOG(INFO) << "Socket " << item << " -> Failed: " << status.GetDetail();
        for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i);

      Somewhere from 5-50 connections fail on my machine with "connect(): timed out" error messages. This happens when the socket sits in the server's accept queue for too long.

      The server runs accept() in a single thread, and then does all the work of starting that server-side handler in the same thread, which includes creating a new thread, taking a lock, creating transports and so on.

      The important thing is to move sockets from waiting-for-accept to the accepted state. It's ok if there's some delay between being accepted, and the connection being completely set up. So the easiest thing to do is to add a small thread pool to the TThreadedServer which handles every aspect of connection set-up except for accept() itself - leaving the main server thread to accept() and Offer() and nothing else. Even if the thread pool has one thread, there's a benefit if the thread pool's queue is large enough to buffer spikes in connection requests.

      With a prototype in place, and one thread in the thread pool, it took ~35s to accept 16k connections, a rate of ~200 accepts per second. Without the patch, that rate drops to ~60 per second. I'm not sure what limits the thread pool solutions - maybe the queue gets full or maybe the test driver that's opening the connections has a limit.

      This will be fixed longer term by IMPALA-2567, but we can reduce the pain this causes for larger clusters with more complex queries here.


        Issue Links



              twmarshall Thomas Tauber-Marshall
              henryr Henry Robinson
              2 Vote for this issue
              18 Start watching this issue