Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-28431

CompletedCheckPoints stored on ZooKeeper is not up-to-date, when JobManager is restarted it fails to recover the job due to "checkpoint FileNotFound exception"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 1.13.2
    • None
    • None
    • flink:1.13.2
      java:1.8

    Description

      We have built a lot of flink clusters in native Kubernetes session mode, flink version 1.13.2, some clusters can run normally for 180 days and some can run for 30 days.
      The following takes an abnormal flink cluster flink-k8s-session-opd-public-1132 as an example.

      Problem Description:
      Appears when jobmanager restarts
      File does not exist: /home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a
      The result of this is that the entire flink cluster cannot be started. Because other tasks in session mode are also affected by the inability to start, the impact is very serious.

      Some auxiliary information:
      1. flink cluster id: flink-k8s-session-opd-public-1132
      2. High-availability.storageDir of cluster configuration: hdfs://neophdfsv2flink/home/flink/recovery/
      3.error job id: 18193cde2c359f492f76c8ce4cd20271
      4. There was a similar issue before: FLINK-8770, but I saw that it was closed without being resolved.
      5. The complete jommanager log I have uploaded to the attachment

      My investigation ideas:

      1. View the node information on the zookeeper corresponding to the jobid 18193cde2c359f492f76c8ce4cd20271:

      [zk: localhost:2181(CONNECTED) 17] ls /flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271
      [0000000000000025852, 0000000000000025851]

      [zk: localhost:2181(CONNECTED) 14] get /flink/flink/flink-k8s-session-opd-public-1132/checkpoints/18193cde2c359f492f76c8ce4cd20271/0000000000000025852

      ??sr;org.apache.flink.runtime.state.RetrievableStreamStateHandle?U?+LwrappedStreamStateHandlet2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr9org.apache.flink.runtime.state.filesystem.FileStateHandle?u?b?J stateSizefilePathtLorg/apache/flink/core/fs/Path;xp??srorg.apache.flink.core.fs.PathLuritLjava/net/URI;xpsr
      java.net.URI?x.C?I?LstringtLjava/lang/String;xptrhdfs://neophdfsv2flink/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4ax
      cZxid = 0x1070932e2
      ctime = Wed Jul 06 02:28:51 UTC 2022
      mZxid = 0x1070932e2
      mtime = Wed Jul 06 02:28:51 UTC 2022
      pZxid = 0x30001c957
      cversion=222
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 545
      numChildren = 0.

      I am sure that my zk node is normal, because there are 10+ flink clusters using the same zk node, but only this cluster has problems, other clusters are normal

      2. View the hdfs edits modification log of the directory corresponding to hdfs:
      ./hdfs-audit.log.1:2022-07-06 10:28:51,752 INFO FSNamesystem.audit: allowed=true ugi=flinkuser@HADOOP.163.GZ (auth:KERBEROS) ip=/10.91.136.213 cmd= create src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a dst=null perm=flinkuser:flinkuser:rw-r-r- proto=rpc
      ./hdfs-audit.log.1:2022-07-06 10:29:26,588 INFO FSNamesystem.audit: allowed=true ugi=flinkuser@HADOOP.163.GZ (auth:KERBEROS) ip=/10.91.136.213 cmd= delete src=/home/flink/recovery/flink/flink-k8s-session-opd-public-1132/completedCheckpoint86fce98d7e4a dst=null perm=null proto=rpc

      I don't know why flink created the directory and then deleted it, and did not update the metadata information to zookeeper, which caused the jobmanager to restart without getting the correct directory and keep restarting.

      Attachments

        Activity

          People

            Unassigned Unassigned
            aresyhzhang aresyhzhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: