[KAFKA-7610] Detect consumer failures in initial JoinGroup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.2.0, 2.1.1
Component/s: consumer
Labels:
None

Description

The session timeout and heartbeating logic in the consumer allow us to detect failures after a consumer joins the group. However, we have no mechanism to detect failures during a consumer's initial JoinGroup when its memberId is empty. When a client fails (e.g. due to a disconnect), the newly created MemberMetadata will be left in the group metadata cache. Typically when this happens, the client simply retries the JoinGroup. Every retry results in a new dangling member created and left in the group. These members are doomed to a session timeout when the group finally finishes the rebalance, but before that time, they are occupying memory. In extreme cases, when a rebalance is delayed (possibly due to a buggy application), this cycle can repeat and the cache can grow quite large.

There are a couple options that come to mind to fix the problem:

1. During the initial JoinGroup, we can detect failed members when the TCP connection fails. This is difficult at the moment because we do not have a mechanism to propagate disconnects from the network layer. A potential option is to treat the disconnect as just another type of request and pass it to the handlers through the request queue.

2. Rather than holding the JoinGroup in purgatory for an indefinite amount of time, we can return earlier with the generated memberId and an error code (say REBALANCE_IN_PROGRESS) to indicate that retry is needed to complete the rebalance. The consumer can then poll for the rebalance using its assigned memberId. And we can detect failures through the session timeout. Obviously this option requires a KIP (and some more thought).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

pom.properties
29/Dec/18 07:46
0.1 kB
Chris Bogan

Issue Links

links to

GitHub Pull Request #5962

mentioned in: Page Loading...; Page Loading...; Page Loading...

Sub-Tasks

There are no Sub-Tasks for this issue.

Activity

People

Assignee:: Jason Gustafson

Reporter:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/Nov/18 07:39

Updated:: 08/Jan/19 21:05

Resolved:: 11/Dec/18 00:11