[KAFKA-12256] auto commit causes delays due to retriable UNKNOWN_TOPIC_OR_PARTITION - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 3.2.0
Component/s: consumer
Labels:
- new-consumer-threading-should-fix

Description

In ~~KAFKA-6829~~ a change was made to the consumer to internally retry commits upon receiving UNKNOWN_TOPIC_OR_PARTITION.

Though this helped mitigate issues around stale broker metadata, there were some valid concerns around the negative effects for routine topic deletion:

https://github.com/apache/kafka/pull/4948

In particular, if a commit is issued for a deleted topic, retries can block the consumer for up to max.poll.interval.ms. This is tunable of course, but any amount of stalling in a consumer can lead to unnecessary lag.

One of the assumptions while permitting the change was that in practice it should be rare for commits to occur for deleted topics, since that would imply messages were being read or published at the time of deletion. It's fair to expect users to not delete topics that are actively published to. But this assumption is false in cases where auto commit is enabled.

With the current implementation of auto commit, the consumer will regularly issue commits for all topics being fetched from, regardless of whether or not messages were actually received. The fetch positions are simply flushed, even when they are 0. This is simple and generally efficient, though it does mean commits are often redundant. Besides the auto commit interval, commits are also issued at the time of rebalance, which is often precisely at the time topics are deleted.

This means that in practice commits for deleted topics are not really rare. This is particularly an issue when the consumer is subscribed to a multitude of topics using a wildcard. For example, a consumer might subscribe to a particular "flavor" of topic with the aim of auditing all such data, and these topics might dynamically come and go. The consumer's metadata and rebalance mechanisms are meant to handle this gracefully, but the end result is that such groups are often blocked in a commit for several seconds or minutes (the default is 5 minutes) whenever a delete occurs. This can sometimes result in significant lag.

Besides having users abandon auto commit in the face of topic deletes, there are probably multiple ways to deal with this, including reconsidering if commits still truly need to be retried here, or if this behavior should be more configurable; e.g. having a separate commit timeout or policy. In some cases the loss of a commit and subsequent message duplication is still preferred to processing delays. And having an artificially low max.poll.interval.ms or rebalance.timeout.ms comes with its own set of concerns.

In the very least the current behavior and pitfalls around delete with active consumers should be documented.

Attachments

Issue Links

Is contained by

KAFKA-13310 KafkaConsumer cannot jump out of the poll method, and the consumer is blocked in the ConsumerCoordinator method maybeAutoCommitOffsetsSync(Timer timer). Cpu and traffic of Broker's side increase sharply

Resolved

is related to

KAFKA-16235 auto commit still causes delays due to retriable UNKNOWN_TOPIC_OR_PARTITION

Open

Activity

People

Assignee:: Unassigned

Reporter:: Ryan Leslie

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Jan/21 23:58

Updated:: 07/Feb/24 19:06

Resolved:: 11/Feb/22 17:59