[FLINK-34006] Flink terminates the execution of an application when there is a network problem between TaskManagers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.17.1
Fix Version/s: None
Component/s: Runtime / Checkpointing, Runtime / Task
Labels:
None

Language:
- Java

Description

Flink terminates an application when two TaskManager are disconnected although there are enough resources in the cluster to run the application and we use checkpoint restart.

We deploy Flink v(1.17.1) on a cluster of six nodes with Ubuntu 18.04, the cluster consists of a JobManager and five TaskManagers. We use Flink's Standalone resource manager. We set the number of slots per TaskManager to one, and submit a WordCount application with a level of parallelism equal to three. We enable Flink checkpointing and restart failover strategy to attempt a restart in case of failure three times before termination and the time between attempts to 10 seconds.

The application starts running on the first 3 TaskManager.

If the communication is broken between two of the TaskManager that run the application, the job fails, and the JobManager tries to restart the job again. When the job fails the resources on the TaskManager are free. When the JobManager restarts the job, it selects the same three TaskManager it choose in the first attempt, and the job fails again. After three trials, Flink terminates the job with an exception: Connecting to remote task manager has failed.

These are the JobManager Configurations:

taskmanager.numberOfTaskSlots: 1
Enable checkpointing: --checkpointing
execution.checkpointing.interval: 3min
Enabling restart failover strategy
restart-strategy.type: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s

command: ./bin/flink run -p 3 examples/streaming/WordCount.jar --checkpointing --input ~/flink/alice.txt

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Sophia

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Jan/24 23:29

Updated:: 05/Jan/24 23:29