[KAFKA-12951] Infinite loop while restoring a GlobalKTable - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.7.0
Fix Version/s: 2.7.2, 2.8.1, 3.0.0
Component/s: streams
Labels:
None

Description

We encountered an issue a few time in some of our Kafka Streams application.
After an unexpected restart of our applications, some instances have not been able to resume operating.

They got stuck while trying to restore the state store of a GlobalKTable. The only way to resume operating was to manually delete their `state.dir`.

We observed the following timeline:

After the restart of the Kafka Streams application, it tries to restore its GlobalKTable
It seeks to the last checkpoint available on the {{state.dir}}: 382 (https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259)

The watermark (endOffset results) returned the offset 383

handling ListOffsetResponse response for XX. Fetched offset 383, timestamp -1

We enter the loop: https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279
Then we invoked the poll(), but the poll returns nothing, so we enter: https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306 and we crash
```
Global task did not make progress to restore state within 300000 ms.
```

The POD restart, and we encounter the same issue until we manually delete the state.dir

Regarding the topic, by leveraging the DumpLogSegment tool, I can see:

Offset 381 - Last business message received
Offset 382 - Txn COMMIT (last message)

I think the real culprit is that the checkpoint is 383 instead of being 382. For information, the global topic is a transactional topic.

While experimenting with the API, it seems that the consumer.position() call is a bit tricky, after a seek() and a poll(), it seems that the position() is actually returning the seek position. After the poll() call, even if no data is returned, the position() is returning the LSO. I did an example on https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb .

Attachments

Issue Links

is caused by

KAFKA-9274 Gracefully handle timeout exceptions on Kafka Streams

Resolved

is related to

KAFKA-12980 Allow consumers to return from poll when position advances due to aborted transactions

Resolved

KAFKA-6607 Kafka Streams lag not zero when input topic transactional

Resolved

links to

GitHub Pull Request #10894

Activity

People

Assignee:: Matthias J. Sax

Reporter:: Damien Gasparina

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Jun/21 15:05

Updated:: 29/Jun/21 13:30

Resolved:: 28/Jun/21 22:31