Details
Description
In ConnectDistributedTest.test_bounce, there are flakey failures that appear to follow this pattern:
- The test is parameterized for hard bounces, and with Incremental Cooperative Rebalancing enabled (does not appear for protocol=eager)
- A source task is on a worker that will experience a hard bounce
- The source task has written records which it has not yet committed in source offsets
- The worker is hard-bounced, and the source task is lost
- Incremental Cooperative Rebalance starts it's scheduled.rebalance.max.delay.ms delay before recovering the task
- The test ends, connectors and Connect are stopped
- The test verifies that the sink connector has only written records that have been committed by the source connector
- This verification fails because the source offsets are stale, and there are un-committed records in the topic, and the sink connector has written at least one of them.
This can be addressed by ensuring that the test waits for the rebalance delay to expire, and for the lost task to recover and commit offsets past the progress it made before the bounce.
Attachments
Issue Links
- relates to
-
KAFKA-10296 Connector task reported RUNNING after hard bounce of worker
- Open
- links to