[KAFKA-10793] Race condition in FindCoordinatorFuture permanently severs connection to group coordinator - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.5.0
Fix Version/s: 2.8.0, 2.7.1, 2.6.2
Component/s: consumer, streams
Labels:
- new-consumer-threading-should-fix

Description

Pretty much as soon as we started actively monitoring the last-rebalance-seconds-ago metric in our Kafka Streams test environment, we started seeing something weird. Every so often one of the StreamThreads (ie a single Consumer instance) would appear to permanently fall out of the group, as evidenced by a monotonically increasing last-rebalance-seconds-ago. We inject artificial network failures every few hours at most, so the group rebalances quite often. But the one consumer never rejoins, with no other symptoms (besides a slight drop in throughput since the remaining threads had to take over this member's work). We're confident that the problem exists in the client layer, since the logs confirmed that the unhealthy consumer was still calling poll. It was also calling Consumer#committed in its main poll loop, which was consistently failing with a TimeoutException.

When I attached a remote debugger to an instance experiencing this issue, the network client's connection to the group coordinator (the one that uses MAX_VALUE - node.id as the coordinator id) was in the DISCONNECTED state. But for some reason it never tried to re-establish this connection, although it did successfully connect to that same broker through the "normal" connection (ie the one that juts uses node.id).

The tl;dr is that the AbstractCoordinator's FindCoordinatorRequest has failed (presumably due to a disconnect), but the findCoordinatorFuture is non-null so a new request is never sent. This shouldn't be possible since the FindCoordinatorResponseHandler is supposed to clear the findCoordinatorFuture when the future is completed. But somehow that didn't happen, so the consumer continues to assume there's still a FindCoordinator request in flight and never even notices that it's dropped out of the group.

These are the only confirmed findings so far, however we have some guesses which I'll leave in the comments. Note that we only noticed this due to the newly added last-rebalance-seconds-ago __metric, and there's no reason to believe this bug hasn't been flying under the radar since the Consumer's inception

Attachments

Issue Links

causes

KAFKA-13563 FindCoordinatorFuture never get cleared in non-group mode( consumer#assign)

Resolved

links to

GitHub Pull Request #9671

Activity

People

Assignee:: A. Sophie Blee-Goldman

Reporter:: A. Sophie Blee-Goldman

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 02/Dec/20 01:10

Updated:: 11/Feb/22 07:31

Resolved:: 27/Jan/21 03:08