[KAFKA-7538] Improve locking model used to update ISRs and HW - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.5.0
Component/s: core
Labels:
None

Description

We currently use a ReadWriteLock in Partition to update ISRs and high water mark for the partition. This can result in severe lock contention if there are multiple producers writing a large amount of data into a single partition.

The current locking model is:

read lock while appending to log on every Produce request on the request handler thread
write lock on leader change, updating ISRs etc. on request handler or scheduler thread
write lock on every replica fetch request to check if ISRs need to be updated and to update HW and ISR on the request handler thread

2) is infrequent, but 1) and 3) may be frequent and can result in lock contention. If there are lots of produce requests to a partition from multiple processes, on the leader broker we may see:

one slow log append locks up one request thread for that produce while holding onto the read lock
(replicationFactor-1) request threads can be blocked waiting for write lock to process replica fetch request
potentially several other request threads processing Produce may be queued up to acquire read lock because of the waiting writers.

In a thread dump with this issue, we noticed several request threads blocked waiting for write, possibly to due to replication fetch retries.

Possible fixes:

Process `Partition#maybeExpandIsr` on a single scheduler thread similar to `Partition#maybeShrinkIsr` so that only a single thread is blocked on the write lock. But this will delay updating ISRs and HW.
Change locking in `Partition#maybeExpandIsr` so that only read lock is acquired to check if ISR needs updating and write lock is acquired only to update ISRs. Also use a different lock for updating HW (perhaps just the Partition object lock) so that typical replica fetch requests complete without acquiring Partition write lock on the request handler thread.

I will submit a PR for 2) , but other suggestions to fix this are welcome.

Attachments

Issue Links

links to

GitHub Pull Request #5866

Activity

People

Assignee:: Rajini Sivaram

Reporter:: Rajini Sivaram

Reviewer:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 24/Oct/18 12:09

Updated:: 02/Jun/20 07:07

Resolved:: 14/Jan/20 14:52