Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-33324

Add flink managed timeout mechanism for backend restore operation




      Hello community, I would like to share an issue our team recently faced and propose a feature to mitigate similar problems in the future.


      Our Flink streaming job encountered consecutive checkpoint failures and subsequently attempted a restart. 
      This failure occurred due to timeouts in two subtasks located within the same task manager. 
      The restore operation for this particular task manager also got stuck, resulting in an "initializing" state lasting over an hour. 
      Once we realized the hang during the restore operation, we terminated the task manager pod, resolving the issue.

      The sequence of events was as follows:

      1. Checkpoint timed out for subtasks within the task manager, referred to as tm-32.
      2. The Flink job failed and initiated a restart.
      3. Restoration was successful for 282 subtasks, but got stuck for the 2 subtasks in tm-32.
      4. While the Flink tasks weren't fully in running state, checkpointing was still being triggered, leading to consecutive checkpoint failures.
      5. These checkpoint failures seemed to be ignored, and did not count to the execution.checkpointing.tolerable-failed-checkpoints configuration.
           As a result, the job remained in the initialization phase for very long period.
      6. Once we found this, we terminated the tm-32 pod, leading to a successful Flink job restart.


      I feel that, a Flink job remaining in the initializing state indefinitely is not ideal. 
      To enhance resilience, I think it would be helpful if we could add timeout feature for restore operation. 
      If the restore operation exceeds a specified duration, an exception should be thrown, causing the job to fail. 
      This way, we can address restore-related issues similarly to how we handle checkpoint failures.


      Just to add, I've made a basic version of this feature to see if it works as expected.
      I've attached a picture from the Flink UI that shows the timeout exception happened during restore operation.
      It's just a start, but I hope it helps with our discussion.
      (I've simulated network chaos, using litmus chaos engineering tool.)


      Thank you for considering my proposal. I'm looking forward to hear your thoughts.
      If there's agreement on this, I'd be happy to work on implementing this feature.


        1. image-2023-10-20-15-16-53-324.png
          224 kB
        2. image-2023-10-20-17-42-11-504.png
          429 kB



            Unassigned Unassigned
            dongwoo.kim dongwoo.kim
            0 Vote for this issue
            4 Start watching this issue