Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-27675

Improve manual savepoint tracking

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      There are 2 problems with the manual savpeoint result observing logic that can cause the reconciler to not make progress with the deployment (recoveries, upgrades etc).

      1. Whenever the jobmanager deployment is not in READY state or the job itself is not RUNNING, the trigger info must be reset and we should not try to query it anymore. Flink will not retry the savepoint if the job fails, restarted anyways.
      2. If there is a sensible error when fetching the savepoint status (such as: 
        There is no savepoint operation with triggerId=xxx for job ) we should simply reset the trigger. These errors will never go away on their own and will simply cause the deployment to get stuck in observing/waiting for a savepoint to complete

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            morhidi Matyas Orhidi
            gyfora Gyula Fora
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment