[FLINK-7844] Fine Grained Recovery triggers checkpoint timeout failure - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.2, 1.4.0
Fix Version/s: 1.4.0
Component/s: Runtime / State Backends
Labels:
None

Description

Context:
We are using "individual" failover (fine-grained) recovery strategy for our embarrassingly parallel router use case. The topic has over 2000 partitions, and parallelism is set to ~180 that dispatched to over 20 task managers with around 180 slots.

Observations:
We've noticed after one task manager termination, even though the individual recovery happens correctly, that the workload was re-dispatched to a new available task manager instance. However, the checkpoint would take 10 mins to eventually timeout, causing all other task managers not able to commit checkpoints. In a worst-case scenario, if job got restarted for other reasons (i.e. job manager termination), that would cause more messages to be re-processed/duplicates compared to the job without fine-grained recovery enabled.

I am suspecting that uber checkpoint was waiting for a previous checkpoint that initiated by the old task manager and thus taking a long time to time out.
Two questions:
1. Is there a configuration that controls this checkpoint timeout?
2. Is there any reason that when Job Manager realizes that Task Manager is gone and workload is redispatched, it still need to wait for the checkpoint initiated by the old task manager?

Checkpoint screenshot in attachments.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.png
15/Oct/17 23:26
98 kB
Zhenzhong Xu

Issue Links

relates to

FLINK-7894 Improve metrics around fine-grained recovery and associated checkpointing behaviors

Closed

links to

GitHub Pull Request #4844

Activity

People

Assignee:: Till Rohrmann

Reporter:: Zhenzhong Xu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 15/Oct/17 23:25

Updated:: 27/Oct/17 17:30

Resolved:: 27/Oct/17 17:30