[IMPALA-4456] Address scalability issues of qs_map_lock_ and client_request_state_map_lock_ - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: Impala 2.7.0
Fix Version/s: Impala 2.12.0
Component/s: Backend
Labels:
- performance
- scalability

Target Version:

Product Backlog

Description

The process wide client_request_state_map_lock_ can be a highly contended lock. It is used as a mutex currently.

Changing it to a reader-writer lock and using it as a writer lock strictly only when necessary can give us some big wins with relatively less effort.

On running some tests, I've also noticed that changing it to a RW lock reduces the number of clients created for use between nodes. The reason being ReportExecStatus() that runs in the context of a thrift connection thread, waits for fairly long periods of time for the client_request_state_map_lock_ on high load systems, causing the RPC sender to be blocked on the RPC. This in turn requires other threads on the sender node to create new client connections to send RPCs instead of reusing old ones, which contribute to nodes hitting their connection limit.

So, this relatively small change can give us wins not just in terms of performance, but scalability too.

EDIT: Trying a WIP patch with RW locks on a larger cluster has showed that it works well (very well) in some cases, but regresses at other times. The reason being, we can't decide wether to starve readers or writers, and performance varies (drastically) according to the workload and the starve option. We need to come up with other ideas to address this issue (A suggestion is to have some sort of bucketed locking scheme, where we split the query_exec_state_map_ into buckets where each bucket has its own lock, thereby reducing lock contention).

EDIT 2 (10/23/17): A simple approach that's tried and tested is to shard the lock and the map it protects, so it allows for better parallel accesses to the client request states. Also, more recent perf runs have showed qs_map_lock_ to be a frequent point of contention too. So, we're accounting for that in this JIRA too.

Attachments

Activity

People

Assignee:: Sailesh Mukil

Reporter:: Sailesh Mukil

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Nov/16 20:32

Updated:: 16/Feb/18 01:53

Resolved:: 15/Feb/18 16:59