[FLINK-25098] Jobmanager CrashLoopBackOff in HA configuration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.13.2, 1.13.3
Fix Version/s: None
Component/s: Deployment / Kubernetes
Labels:
None
Environment:
Reproduced with:
- Persistent jobs storage provided by the rocks-cephfs storage class.
- OpenShift 4.9.5.

Description

In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to CrashLoopBackoff for all replicas.

Attaching the full logs of the jobmanager and tls-proxy containers of jobmanager pod:
jm-flink-ha-jobmanager-log.txt
jm-flink-ha-tls-proxy-log.txt

Reproduced with:

Persistent jobs storage provided by the rocks-cephfs storage class (shared by all replicas - ReadWriteMany) and mount path set via high-availability.storageDir: file///<dir>.
OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a "one-shot" trouble.

Remarks:

This is a follow-up of https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524.
Picked Critical severity as HA is critical for our product.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flink_checkpoint_issue.txt
30/Jun/22 11:03
57 kB
MAU CHEE YEN
iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log
01/Dec/21 15:17
67 kB
Neeraj Laad
JM-FlinkException-checkpointHA.txt
21/Jan/22 09:52
5 kB
Enrique Lacal
jm-flink-ha-jobmanager-log.txt
29/Nov/21 20:55
38 kB
Adrian Vasiliu
jm-flink-ha-tls-proxy-log.txt
29/Nov/21 20:56
8 kB
Adrian Vasiliu

Issue Links

is related to

FLINK-28265 Inconsistency in Kubernetes HA service: broken state handle

Closed

relates to

FLINK-24543 Zookeeper connection issue causes inconsistent state in Flink

Closed

FLINK-22494 Avoid discarding checkpoints in case of failure

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Adrian Vasiliu

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 29/Nov/21 20:54

Updated:: 11/Jul/22 12:38