[FLINK-24607] SourceCoordinator may miss to close SplitEnumerator when failover frequently - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.13.3
Fix Version/s: 1.14.4, 1.15.0, 1.13.7
Component/s: Connectors / Common
Labels:
- pull-request-available

Description

We are having a connection leak problem when using mysql-cdc [1] source. We observed that many enumerators are not closed from the JM log.

➜  test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Restoring SplitEnumerator" | wc -l
     264
➜  test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Starting split enumerator" | wc -l
     264
➜  test123 cat jobmanager.log | grep "MySqlSourceEnumerator \[\] - Starting enumerator" | wc -l
     263
➜  test123 cat jobmanager.log | grep "SourceCoordinator \[\] - Closing SourceCoordinator" | wc -l
     264
➜  test123 cat jobmanager.log | grep "MySqlSourceEnumerator \[\] - Closing enumerator" | wc -l
     195

We added "Closing enumerator" log in MySqlSourceEnumerator#close(), and "Starting enumerator" in MySqlSourceEnumerator#start(). From the above result you can see that SourceCoordinator is restored and closed 264 times, split enumerator is started 264 but only closed 195 times. It seems that SourceCoordinator misses to close enumerator when job failover frequently.

I also went throught the code of SourceCoordinator and found some suspicious point:

The started flag and enumerator is assigned in the main thread, however SourceCoordinator#close() is executed async by DeferrableCoordinator#closeAsync. That means the close method will check the started and enumerator variable async. Is there any concurrency problem here which mean lead to dirty read and miss to close the enumerator?

I'm still not sure, because it's hard to reproduce locally, and we can't deploy a custom flink version to production env.

[1]: https://github.com/ververica/flink-cdc-connectors

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

jobmanager.log
21/Oct/21 03:33
4.66 MB
Jark Wu

Issue Links

links to

GitHub Pull Request #18745

Activity

People

Assignee:: Jiangjie Qin

Reporter:: Jark Wu

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 21/Oct/21 03:33

Updated:: 24/Feb/22 01:55

Resolved:: 24/Feb/22 01:55