Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-16931

Large _metadata file lead to JobManager not responding when restart

    XMLWordPrintableJSON

Details

    Description

      When _metadata file is big, JobManager could never recover from checkpoint. It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is related log: 

       2020-04-01 17:08:25,689 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper.
       2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 3 checkpoints in ZooKeeper.
       2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 3 checkpoints from storage.
       2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 50.
       2020-04-01 17:08:48,589 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 51.
       2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
      

      Digging into the code, looks like ExecutionGraph::restart runs in JobMaster main thread and finally calls ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download file form DFS. The main thread is basically blocked for a while because of this. One possible solution is to making the downloading part async. More things might need to consider as the original change tries to make it single-threaded. https://github.com/apache/flink/pull/7568

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              qqibrow Lu Niu
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated: