Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-25098

Jobmanager CrashLoopBackOff in HA configuration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.13.2, 1.13.3
    • None
    • None
    • Reproduced with:

      • Persistent jobs storage provided by the rocks-cephfs storage class.
      • OpenShift 4.9.5.

    Description

      In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to CrashLoopBackoff for all replicas.

      Attaching the full logs of the jobmanager and tls-proxy containers of jobmanager pod:
      jm-flink-ha-jobmanager-log.txt
      jm-flink-ha-tls-proxy-log.txt

      Reproduced with:

      • Persistent jobs storage provided by the rocks-cephfs storage class (shared by all replicas - ReadWriteMany) and mount path set via high-availability.storageDir: file///<dir>.
      • OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a "one-shot" trouble.

      Remarks:

      Attachments

        1. flink_checkpoint_issue.txt
          57 kB
          MAU CHEE YEN
        2. iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log
          67 kB
          Neeraj Laad
        3. JM-FlinkException-checkpointHA.txt
          5 kB
          Enrique Lacal
        4. jm-flink-ha-jobmanager-log.txt
          38 kB
          Adrian Vasiliu
        5. jm-flink-ha-tls-proxy-log.txt
          8 kB
          Adrian Vasiliu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              adrianalexvasiliu Adrian Vasiliu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: