[FLINK-31903] Caching records fails in BroadcastUtils#withBroadcastStream - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Not A Bug
Affects Version/s: ml-2.3.0
Fix Version/s: None
Component/s: Library / Machine Learning
Labels:
None

Description

When caching more than 1,000,000 records using BroadcastUtils#withBroadcast, it throws exception as follows:

Caused by: org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable failure threshold.
    at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:206)
    at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleTaskLevelCheckpointException(CheckpointFailureManager.java:191)
    at org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:124)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2078)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveDeclineMessage(CheckpointCoordinator.java:1038)
    at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$declineCheckpoint$2(ExecutionGraphHandler.java:103)
    at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

It seems that the bug comes from caching too many records when calling

AbstractBroadcastWrapperOperator#snapshot.

The failed case could be found here: https://github.com/zhipeng93/flink-ml/tree/FLINK-31903-fail-case

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Zhipeng Zhang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Apr/23 07:09

Updated:: 11/May/23 12:26

Resolved:: 11/May/23 12:26