Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-19778

Failed job reinitiated with wrong checkpoint after a ZK reconnection

    XMLWordPrintableJSON

Details

    Description

      We have a job of Flink 1.11.0 running on YARN that reached FAILED state because its jobmanager lost leadership during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again with no checkpoints found in ZK, and hence an earlier savepoint was used to restore the job, which rewound the job unexpectedly.
       
      For details please see the jobmanager logs in the attachment.

      Attachments

        1. jm_log
          46 kB
          Paul Lin

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Paul Lin Paul Lin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: