In one of my in-memory read only testing(100% get requests), one of the top scalibility bottleneck came from the single callQueue. A tentative sharing this callQueue according to the rpc handler number showed a big throughput improvement(the original get() qps is around 60k, after this one and other hotspot tunning, i got 220k get() qps in the same single region server) in a YCSB read only scenario.
Another stuff we can do is seperating the queue into read call queue and write call queue, we had done it in our internal branch, it would helpful in some outages, to avoid all read or all write requests ran out of all handler threads.
One more stuff is changing the current blocking behevior once the callQueue is full, considering the full callQueue almost means the backend processing is slow somehow, so a fail-fast here should be more reasonable if we using HBase as a low latency processing system. see "callQueue.put(call)"