[FLINK-29099] Deadlock for Single Subtask in Kinesis Consumer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.9.3, 1.10.3, 1.11.6, 1.12.7, 1.13.6, 1.14.5, 1.15.3
Fix Version/s: 1.17.0
Component/s: Connectors / Kinesis
Labels:

Flags:

Patch
Language:
- java

Description

Deadlock is reached as the result of:

max lookahead reached for local watermark
idle state for subtask

The lookahead prevents the RecordEmitter from emitting a new record. The idle state prevents the global watermark from being updated.

To exit this deadlock state, we need to complete the TODO here which updates the global watermark while the subtask is marked idle, which will then allow us to emit a record again as the lookahead is no longer reached.

Context:

We reached this scenario at Lyft as a result of prolonged CPU throttling on all FlinkKinesisConsumer threads for multiple minutes.

Walking through the series of events for a single subtask:

prolonged CPU throttling occurs and no logs are seen from any FlinkKinesisConsumer thread for up to 15 minutes
after CPU throttling the subtask is marked idle
the subtask has reached the lookahead for its local watermark relative to the global watermark
WatermarkSyncCallback indicates the subtask as idle and does not update the global watermark
emitQueue fills to max
RecordEmitter cannot emit records due to the max lookahead
Deadlock on subtask

At this point, we had not realized what had happened and processing of all other shards/subtasks had continued for multiple days. When we finally restarted the application, we saw the following behavior:

global watermark recalculated after all subtasks consumed data based on the last kinesis record sequence number
global watermark moved back in time multiple days, to when the subtask was first marked idle
the single subtask processed data while all others remained idle due to the lookahead

This would have continued until the subtask had caught up to the others and thus the global watermark is within reach of the lookahead for other subtasks.

Repro:

Too difficult to repro the exact scenario.

Attachments

Issue Links

links to

GitHub Pull Request #20842

GitHub Pull Request #20843

GitHub Pull Request #20844

Activity

People

Assignee:: seth saperstein

Reporter:: seth saperstein

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 24/Aug/22 19:05

Updated:: 28/Nov/22 23:37

Resolved:: 28/Nov/22 23:36

Time Tracking

Estimated:

48h

Remaining:

48h

Logged:

Not Specified