[IMPALA-4135] Thrift threaded server times-out waiting connections during high load - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 2.7.0
Fix Version/s: Impala 2.5.5, Impala 2.6.3, Impala 2.8.0
Component/s: Distributed Exec
Labels:
None

Target Version:

Impala 2.5.5, Impala 2.6.3, Impala 2.8.0

Description

During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections.

Here's a repro:

ThreadPool<int64_t> pool("group", "test", 128, 10000, [](int tid, const int64_t& item) {
        using Client = ThriftClient<ImpalaInternalServiceClient>;
        Client* client = new Client("127.0.0.1", 22000, "", NULL, false);
        Status status = client->Open();
        if (status.ok()) {
          LOG(INFO) << "Socket " << item << " -> OK";
        } else {
          LOG(INFO) << "Socket " << item << " -> Failed: " << status.GetDetail();
        }
      });
  for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i);
  pool.DrainAndShutdown();

Somewhere from 5-50 connections fail on my machine with "connect(): timed out" error messages. This happens when the socket sits in the server's accept queue for too long.

The server runs accept() in a single thread, and then does all the work of starting that server-side handler in the same thread, which includes creating a new thread, taking a lock, creating transports and so on.

The important thing is to move sockets from waiting-for-accept to the accepted state. It's ok if there's some delay between being accepted, and the connection being completely set up. So the easiest thing to do is to add a small thread pool to the TThreadedServer which handles every aspect of connection set-up except for accept() itself - leaving the main server thread to accept() and Offer() and nothing else. Even if the thread pool has one thread, there's a benefit if the thread pool's queue is large enough to buffer spikes in connection requests.

With a prototype in place, and one thread in the thread pool, it took ~35s to accept 16k connections, a rate of ~200 accepts per second. Without the patch, that rate drops to ~60 per second. I'm not sure what limits the thread pool solutions - maybe the queue gets full or maybe the test driver that's opening the connections has a limit.

This will be fixed longer term by ~~IMPALA-2567~~, but we can reduce the pain this causes for larger clusters with more complex queries here.

Attachments

Issue Links

relates to

IMPALA-4069 Introduce startup option to create and cache backend connections on startup

Resolved

Activity

People

Assignee:: Thomas Tauber-Marshall

Reporter:: Henry Robinson

Votes:: 2 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 15/Sep/16 00:08

Updated:: 22/Mar/18 23:16

Resolved:: 08/Oct/16 06:05