When doing concurrency testing as part of the competitive benchmarking I noticed that it is very difficult to saturate all CPUs @100%
Below is a snapshot from htop during a concurrency run, state below closely mimics the steady state, note that CPUs 41-60 are less busy compared to 1-20.
Then I ran the command below which dumps the threads and processor associated with each, reference.
for i in $(pgrep impalad); do ps -mo pid,tid,fname,user,psr -p $i;done
From the man page for ps :
The output showed that a large number of threads are running on core 61, not surprisingly the 1K threads are all thrift-server threads, so I am wondering if this is skewing the kernel's ability to evenly distribute the threads across the cores or something.
I did a followup experiment using by profiling different core ranges on the system :
Run 80 concurrent queries dominated by shuffle exchange
Profile cores 01-20 to foo_01-20
Profile cores 41-60 to foo_41-60
Results showed that :
Cores 01-20 had 50% more instructions retired
Cores 01-20 show significantly more contention on pthread_cond_wait, base::internal::SpinLockDelay and __lll_lock_wait
Skew is dominated by DataStreamSender
ScannerThread(s) also show significant skew