[KAFKA-5116] Controller updates to ISR holds the controller lock for a very long time - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.10.1.0, 0.10.2.0
Fix Version/s: None
Component/s: controller
Labels:
None

Description

Hello!

Lately, we have noticed slow (or no) results when monitoring the broker's ISR using JMX. Many of these requests appear to be 'hung' for a very long time (eg: >2m). We've dug a bunch, and found that in our case, sometimes the controllerLock can be held for multiple minutes in the IsrChangeNotifier callback.

Inside the lock, we are reading from Zookeeper for each partition in the changeset. With a large changeset (eg: >500 partitions), this operation can take a long time to complete.

In ~~KAFKA-2406~~, throttling was introduced to prevent overwhelming the controller with many changesets at once. However, this does not take into consideration large changesets.

We have identified two potential remediations we'd like to discuss further:

Move the Zookeeper request outside of the lock. This would then only lock for the controller update and processing of the changeset.

Send limited changesets to Zookeeper when calling the maybePropagateIsrChanges. When dealing with lots of partitions (eg: >1000) it may be useful to batch the changesets in groups of 100 rather the send the entire list to Zookeeper at once.

We're happy working on patches for either or both of these, but we are unsure of the safety around these two proposals. Specifically, moving the Zookeeper request out of the lock may be unsafe.

Holding these locks for long periods of time seems problematic - it means that broker failure won't be detected and acted upon quickly.

Attachments

Issue Links

is related to

KAFKA-6469 ISR change notification queue can prevent controller from making progress

Open

Activity

People

Assignee:: Unassigned

Reporter:: Justin Downing

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 24/Apr/17 15:34

Updated:: 17/Feb/19 19:10