Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-17127

Make pod creating retry interval configurable

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      Follow the discussion in this PR[1].

      In the current implementation, the POD_CREATION_RETRY_INTERVAL is set to fixed value with "3s", which means when creating a taskmanager pod failed, we will schedule a delay retry in 3s. It could work for most cases. However, we still have a risk that too many retried of different Flink clusters will flood to Kubernetes api server. So we need to add an initial and max setting for retry interval, similar to NETWORK_REQUEST_BACKOFF_INITIAL/NETWORK_REQUEST_BACKOFF_MAX.

       

      We could add an ExponentialBackoff for the retry policy. The backoff could be reset to initial value when a new TaskManager created successfully after several retries.

       

      Inspired by FLINK-17176, the pod crashed exceptionally, we should also set the retry interval to avoid the requests floods to K8s api server. But it could be done in a separate ticket/PR.

       

      [1]https://github.com/apache/flink/pull/11427#discussion_r406318451

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wangyang0918 Yang Wang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: