[FLINK-27675] Improve manual savepoint tracking - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: kubernetes-operator-1.0.0
Component/s: Kubernetes Operator
Labels:
- pull-request-available

Description

There are 2 problems with the manual savpeoint result observing logic that can cause the reconciler to not make progress with the deployment (recoveries, upgrades etc).

Whenever the jobmanager deployment is not in READY state or the job itself is not RUNNING, the trigger info must be reset and we should not try to query it anymore. Flink will not retry the savepoint if the job fails, restarted anyways.
If there is a sensible error when fetching the savepoint status (such as:
There is no savepoint operation with triggerId=xxx for job ) we should simply reset the trigger. These errors will never go away on their own and will simply cause the deployment to get stuck in observing/waiting for a savepoint to complete