Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4135

Thrift threaded server times-out waiting connections during high load

    Details

    • Docs Text:
      Hide
      This jira fixes a tcp connection scalability issue w.r.t highly concurrent workloads on an Impala cluster. It improves the connection accept rate under stress by accepting them asap and qeueing them for further processing. By unblocking accept(), the connection requesting client doesn't timeout. It adds a new startup param --accepted_cnxn_queue_depth that controls the accept queue depth and defaults to 10000. Based on our stress testing on a 250 node cluster, for clusters of that scale, this should be increased to ~100000
      Show
      This jira fixes a tcp connection scalability issue w.r.t highly concurrent workloads on an Impala cluster. It improves the connection accept rate under stress by accepting them asap and qeueing them for further processing. By unblocking accept(), the connection requesting client doesn't timeout. It adds a new startup param --accepted_cnxn_queue_depth that controls the accept queue depth and defaults to 10000. Based on our stress testing on a 250 node cluster, for clusters of that scale, this should be increased to ~100000

      Description

      During times of high load, Thrift's TThreadedServer can't keep up with the rate of new socket connections.

      Here's a repro:

      
      ThreadPool<int64_t> pool("group", "test", 128, 10000, [](int tid, const int64_t& item) {
              using Client = ThriftClient<ImpalaInternalServiceClient>;
              Client* client = new Client("127.0.0.1", 22000, "", NULL, false);
              Status status = client->Open();
              if (status.ok()) {
                LOG(INFO) << "Socket " << item << " -> OK";
              } else {
                LOG(INFO) << "Socket " << item << " -> Failed: " << status.GetDetail();
              }
            });
        for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i);
        pool.DrainAndShutdown();
      

      Somewhere from 5-50 connections fail on my machine with "connect(): timed out" error messages. This happens when the socket sits in the server's accept queue for too long.

      The server runs accept() in a single thread, and then does all the work of starting that server-side handler in the same thread, which includes creating a new thread, taking a lock, creating transports and so on.

      The important thing is to move sockets from waiting-for-accept to the accepted state. It's ok if there's some delay between being accepted, and the connection being completely set up. So the easiest thing to do is to add a small thread pool to the TThreadedServer which handles every aspect of connection set-up except for accept() itself - leaving the main server thread to accept() and Offer() and nothing else. Even if the thread pool has one thread, there's a benefit if the thread pool's queue is large enough to buffer spikes in connection requests.

      With a prototype in place, and one thread in the thread pool, it took ~35s to accept 16k connections, a rate of ~200 accepts per second. Without the patch, that rate drops to ~60 per second. I'm not sure what limits the thread pool solutions - maybe the queue gets full or maybe the test driver that's opening the connections has a limit.

      This will be fixed longer term by IMPALA-2567, but we can reduce the pain this causes for larger clusters with more complex queries here.

        Issue Links

          Activity

          Hide
          henryr Henry Robinson added a comment -

          More measurements:

          Test #Cnxns Time Rate Notes
          accept() only 16k 15s ~1100 cnxns/s  
          Single-threaded worker pool, 10k item queue 16k 8s ~2000 cnxns/s See 1. below
          Four thread worker pool, 10k item queue 16k 63s ~260 cnxns/s See 2. below
          Single-threaded worker pool, single item queue 16k 137s 119 cnxns/s Simulates current behavior
          Four thread worker pool, single item queue 16k 130s ~120 cnxns/s  

          These results are partly counter-intuitive:

          • 1. Why is accept() only slower than accept() + a thread pool?*
            The accept()-only test runs the destructor for TSocket in its own thread (because it doesn't transfer it to the accept thread and keep it alive). Transferring ownership of the TSocket elsewhere would probably improve the throughput.
          • 2. Why is a four-thread pool so much slower than a one-thread pool?*
            My best guess is that there's more contention for the queue locks, both between worker threads and the accept thread that tries to write to the queue. IMPALA-4026 might improve this.

          We can see that the greatest improvement comes from having a buffer between worker and acceptor threads to allow the acceptor thread to run without impediment.

          Show
          henryr Henry Robinson added a comment - More measurements: Test #Cnxns Time Rate Notes accept() only 16k 15s ~1100 cnxns/s   Single-threaded worker pool, 10k item queue 16k 8s ~2000 cnxns/s See 1. below Four thread worker pool, 10k item queue 16k 63s ~260 cnxns/s See 2. below Single-threaded worker pool, single item queue 16k 137s 119 cnxns/s Simulates current behavior Four thread worker pool, single item queue 16k 130s ~120 cnxns/s   These results are partly counter-intuitive: 1. Why is accept() only slower than accept() + a thread pool?* The accept() -only test runs the destructor for TSocket in its own thread (because it doesn't transfer it to the accept thread and keep it alive). Transferring ownership of the TSocket elsewhere would probably improve the throughput. 2. Why is a four-thread pool so much slower than a one-thread pool?* My best guess is that there's more contention for the queue locks, both between worker threads and the accept thread that tries to write to the queue. IMPALA-4026 might improve this. We can see that the greatest improvement comes from having a buffer between worker and acceptor threads to allow the acceptor thread to run without impediment.
          Hide
          alan@cloudera.com Alan Choi added a comment -

          To achieve those rate, how much cpu resource do we need?

          Show
          alan@cloudera.com Alan Choi added a comment - To achieve those rate, how much cpu resource do we need?
          Hide
          henryr Henry Robinson added a comment -

          Very little. CPU doesn't seem to get about ~5%.

          Show
          henryr Henry Robinson added a comment - Very little. CPU doesn't seem to get about ~5%.
          Show
          henryr Henry Robinson added a comment - https://github.com/apache/incubator-impala/commit/a9c40595549bfb74d2b9796a0a7098361793492e
          Hide
          bpark_impala_7886 Bo soon Park added a comment -

          @henry Could you tell me how to measure the #Cnxn, Time in the benchmark test ?

          Show
          bpark_impala_7886 Bo soon Park added a comment - @henry Could you tell me how to measure the #Cnxn, Time in the benchmark test ?
          Hide
          henryr Henry Robinson added a comment -

          Bo soon Park - I added timers manually to the code.

          Show
          henryr Henry Robinson added a comment - Bo soon Park - I added timers manually to the code.

            People

            • Assignee:
              twmarshall Thomas Tauber-Marshall
              Reporter:
              henryr Henry Robinson
            • Votes:
              2 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development