Description
We encountered an issue a few time in some of our Kafka Streams application.
After an unexpected restart of our applications, some instances have not been able to resume operating.
They got stuck while trying to restore the state store of a GlobalKTable. The only way to resume operating was to manually delete their `state.dir`.
We observed the following timeline:
- After the restart of the Kafka Streams application, it tries to restore its GlobalKTable
- It seeks to the last checkpoint available on the {{state.dir}}: 382 (https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259)
- The watermark (endOffset results) returned the offset 383
handling ListOffsetResponse response for XX. Fetched offset 383, timestamp -1
- We enter the loop: https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279
- Then we invoked the poll(), but the poll returns nothing, so we enter: https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306 and we crash
Global task did not make progress to restore state within 300000 ms.
- The POD restart, and we encounter the same issue until we manually delete the state.dir
Regarding the topic, by leveraging the DumpLogSegment tool, I can see:
- Offset 381 - Last business message received
- Offset 382 - Txn COMMIT (last message)
I think the real culprit is that the checkpoint is 383 instead of being 382. For information, the global topic is a transactional topic.
While experimenting with the API, it seems that the consumer.position() call is a bit tricky, after a seek() and a poll(), it seems that the position() is actually returning the seek position. After the poll() call, even if no data is returned, the position() is returning the LSO. I did an example on https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb .
Attachments
Issue Links
- is caused by
-
KAFKA-9274 Gracefully handle timeout exceptions on Kafka Streams
- Resolved
- is related to
-
KAFKA-12980 Allow consumers to return from poll when position advances due to aborted transactions
- Resolved
-
KAFKA-6607 Kafka Streams lag not zero when input topic transactional
- Resolved
- links to