[FLINK-33324] Add flink managed timeout mechanism for backend restore operation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Checkpointing, Runtime / State Backends
Labels:
None

Description

Hello community, I would like to share an issue our team recently faced and propose a feature to mitigate similar problems in the future.

Issue

Our Flink streaming job encountered consecutive checkpoint failures and subsequently attempted a restart.
This failure occurred due to timeouts in two subtasks located within the same task manager.
The restore operation for this particular task manager also got stuck, resulting in an "initializing" state lasting over an hour.
Once we realized the hang during the restore operation, we terminated the task manager pod, resolving the issue.

The sequence of events was as follows:

1. Checkpoint timed out for subtasks within the task manager, referred to as tm-32.
2. The Flink job failed and initiated a restart.
3. Restoration was successful for 282 subtasks, but got stuck for the 2 subtasks in tm-32.
4. While the Flink tasks weren't fully in running state, checkpointing was still being triggered, leading to consecutive checkpoint failures.
5. These checkpoint failures seemed to be ignored, and did not count to the execution.checkpointing.tolerable-failed-checkpoints configuration.
As a result, the job remained in the initialization phase for very long period.
6. Once we found this, we terminated the tm-32 pod, leading to a successful Flink job restart.

Suggestion

I feel that, a Flink job remaining in the initializing state indefinitely is not ideal.
To enhance resilience, I think it would be helpful if we could add timeout feature for restore operation.
If the restore operation exceeds a specified duration, an exception should be thrown, causing the job to fail.
This way, we can address restore-related issues similarly to how we handle checkpoint failures.

Notes

Just to add, I've made a basic version of this feature to see if it works as expected.
I've attached a picture from the Flink UI that shows the timeout exception happened during restore operation.
It's just a start, but I hope it helps with our discussion.
(I've simulated network chaos, using litmus chaos engineering tool.)

Thank you for considering my proposal. I'm looking forward to hear your thoughts.
If there's agreement on this, I'd be happy to work on implementing this feature.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2023-10-20-15-16-53-324.png
20/Oct/23 06:16
224 kB
dongwoo.kim
image-2023-10-20-17-42-11-504.png
20/Oct/23 08:42
429 kB
dongwoo.kim

Activity

People

Assignee:: Unassigned

Reporter:: dongwoo.kim

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Oct/23 08:38

Updated:: 27/Oct/23 15:48