[FLINK-16931] Large _metadata file lead to JobManager not responding when restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.9.2, 1.10.0, 1.11.0, 1.12.0
Fix Version/s: None
Component/s: Runtime / Checkpointing, Runtime / Coordination
Labels:
- auto-unassigned
- stale-minor

Description

When _metadata file is big, JobManager could never recover from checkpoint. It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is related log:

 2020-04-01 17:08:25,689 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 3 checkpoints in ZooKeeper.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 3 checkpoints from storage.
 2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 50.
 2020-04-01 17:08:48,589 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 51.
 2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.

Digging into the code, looks like ExecutionGraph::restart runs in JobMaster main thread and finally calls ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download file form DFS. The main thread is basically blocked for a while because of this. One possible solution is to making the downloading part async. More things might need to consider as the original change tries to make it single-threaded. https://github.com/apache/flink/pull/7568

Attachments

Issue Links

is related to

FLINK-16770 Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file

Closed

FLINK-13698 Rework threading model of CheckpointCoordinator

Reopened

Activity

People

Assignee:: Unassigned

Reporter:: Lu Niu

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 02/Apr/20 00:10

Updated:: 23/Nov/21 14:30