[KAFKA-10229] Kafka stream dies for no apparent reason, no errors logged on client or server - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Invalid
Affects Version/s: 2.4.1
Fix Version/s: None
Component/s: streams
Labels:
None

Description

My broker and clients are 2.4.1. I'm currently running a single broker. I have a Kafka stream with exactly once processing turned on. I also have an uncaught exception handler defined on the client. I have a stream which I noticed was lagging. Upon investigation, I see that the consumer group was empty.

On restarting the consumers, the consumer group re-established itself, but after about 8 minutes, the group became empty again. There is nothing logged on the client side about any stream errors, despite the existence of an uncaught exception handler.

In the broker logs, I see that about 8 minutes after the clients restart / the stream goes to RUNNING state:

```
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Member cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 in group produs-cisFileIndexer-stream has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Preparing to rebalance group produs-cisFileIndexer-stream in state PreparingRebalance with old generation 228 (__consumer_offsets-3) (reason: removing member cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
```

so according to this the consumer heartbeat has expired. I don't know why this would be, logging shows that the stream was running and processing messages normally and then just stopped processing anything about 4 minutes before it dies, with no apparent errors or issues or anything logged via the uncaught exception handler.

It doesn't appear to be related to any specific poison pill type messages: restarting the stream causes it to reprocess a bunch more messages from the backlog, and then die again approximately 8 minutes later. At the time of the last message consumed by the stream, there are no `INFO`-level or above logs either in the client or the broker, or any errors whatsoever. The stream consumption simply stops.

There are two consumers – even if I limit consumption to only a single consumer, the same thing happens.

The runtime environment is Kubernetes.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Raman Gupta

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 02/Jul/20 18:02

Updated:: 14/Jul/20 18:40

Resolved:: 14/Jul/20 18:40