The process wide client_request_state_map_lock_ can be a highly contended lock. It is used as a mutex currently.
Changing it to a reader-writer lock and using it as a writer lock strictly only when necessary can give us some big wins with relatively less effort.
On running some tests, I've also noticed that changing it to a RW lock reduces the number of clients created for use between nodes. The reason being ReportExecStatus() that runs in the context of a thrift connection thread, waits for fairly long periods of time for the client_request_state_map_lock_ on high load systems, causing the RPC sender to be blocked on the RPC. This in turn requires other threads on the sender node to create new client connections to send RPCs instead of reusing old ones, which contribute to nodes hitting their connection limit.
So, this relatively small change can give us wins not just in terms of performance, but scalability too.
EDIT: Trying a WIP patch with RW locks on a larger cluster has showed that it works well (very well) in some cases, but regresses at other times. The reason being, we can't decide wether to starve readers or writers, and performance varies (drastically) according to the workload and the starve option. We need to come up with other ideas to address this issue (A suggestion is to have some sort of bucketed locking scheme, where we split the query_exec_state_map_ into buckets where each bucket has its own lock, thereby reducing lock contention).
EDIT 2 (10/23/17): A simple approach that's tried and tested is to shard the lock and the map it protects, so it allows for better parallel accesses to the client request states. Also, more recent perf runs have showed qs_map_lock_ to be a frequent point of contention too. So, we're accounting for that in this JIRA too.