[FLINK-19154] Application mode deletes HA data in case of suspended ZooKeeper connection - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.11.1, 1.12.0
Fix Version/s: 1.11.3, 1.12.0
Component/s: Client / Job Submission
Labels:
- pull-request-available
Environment:

Run a stand-alone cluster that runs a single job (if you are familiar with the way Ververica Platform runs Flink jobs, we use a very similar approach). It runs Flink 1.11.1 straight from the official docker image.

Description

A user reported that Flink's application mode deletes HA data in case of a suspended ZooKeeper connection [1].

The problem seems to be that the ApplicationDispatcherBootstrap class produces an exception (that the request job can no longer be found because of a lost ZooKeeper connection) which will be interpreted as a job failure. Due to this interpretation, the cluster will be shut down with a terminal state of FAILED which will cause the HA data to be cleaned up. The exact problem occurs in the JobStatusPollingUtils.getJobResult which is called by ApplicationDispatcherBootstrap.getJobResult().

The above described behaviour can be found in this log [2].

[1] http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html
[2] https://pastebin.com/raw/uH9KDU2L