Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.13.2, 1.13.3
-
None
-
None
-
Reproduced with:
- Persistent jobs storage provided by the rocks-cephfs storage class.
- OpenShift 4.9.5.
Description
In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to CrashLoopBackoff for all replicas.
Attaching the full logs of the jobmanager and tls-proxy containers of jobmanager pod:
jm-flink-ha-jobmanager-log.txt
jm-flink-ha-tls-proxy-log.txt
Reproduced with:
- Persistent jobs storage provided by the rocks-cephfs storage class (shared by all replicas - ReadWriteMany) and mount path set via high-availability.storageDir: file///<dir>.
- OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a "one-shot" trouble.
Remarks:
- This is a follow-up of https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
- Picked Critical severity as HA is critical for our product.
Attachments
Attachments
Issue Links
- is related to
-
FLINK-28265 Inconsistency in Kubernetes HA service: broken state handle
- Closed
- relates to
-
FLINK-24543 Zookeeper connection issue causes inconsistent state in Flink
- Closed
-
FLINK-22494 Avoid discarding checkpoints in case of failure
- Closed