1.9.3, 1.10.3, 1.11.6, 1.12.7, 1.13.6, 1.14.5, 1.15.3
Deadlock is reached as the result of:
- max lookahead reached for local watermark
- idle state for subtask
The lookahead prevents the RecordEmitter from emitting a new record. The idle state prevents the global watermark from being updated.
To exit this deadlock state, we need to complete the TODO here which updates the global watermark while the subtask is marked idle, which will then allow us to emit a record again as the lookahead is no longer reached.
We reached this scenario at Lyft as a result of prolonged CPU throttling on all FlinkKinesisConsumer threads for multiple minutes.
Walking through the series of events for a single subtask:
- prolonged CPU throttling occurs and no logs are seen from any FlinkKinesisConsumer thread for up to 15 minutes
- after CPU throttling the subtask is marked idle
- the subtask has reached the lookahead for its local watermark relative to the global watermark
- WatermarkSyncCallback indicates the subtask as idle and does not update the global watermark
- emitQueue fills to max
- RecordEmitter cannot emit records due to the max lookahead
- Deadlock on subtask
At this point, we had not realized what had happened and processing of all other shards/subtasks had continued for multiple days. When we finally restarted the application, we saw the following behavior:
- global watermark recalculated after all subtasks consumed data based on the last kinesis record sequence number
- global watermark moved back in time multiple days, to when the subtask was first marked idle
- the single subtask processed data while all others remained idle due to the lookahead
This would have continued until the subtask had caught up to the others and thus the global watermark is within reach of the lookahead for other subtasks.
Too difficult to repro the exact scenario.