[FLINK-17933] TaskManager was terminated on Yarn - investigate - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 1.11.0
Fix Version/s: 1.11.0
Component/s: Deployment / YARN, Runtime / Task
Labels:
None

Description

When running a job on Yarn cluster (load testing) some jobs result in failures.

Initial symptoms are no bytes written/transferred in CSV and failures in logs:

2020-05-17 10:02:32,858 WARN org.apache.flink.runtime.taskmanager.Task [] - Map -> Flat Map (138/160) (e49f7ea26b633c8035f2a919b1c580c8) switched from RUNNING to FAILED.

It turned out that all such failures were caused by "Connection reset" from a single IP, except for one "Leadership lost" error (another IP).

Connection reset was likely caused by TM receiving SIGTERM (container_1589453804748_0118_01_000004 and 5 both on ip-172-31-42-229):

2020-05-17 10:02:31,362 INFO org.apache.flink.yarn.YarnTaskExecutorRunner [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

Other TMs received SIGTERM one minute later (all logs were uploaded at the same time though).

From the JM it looked like this:

2020-05-17 10:02:23,583 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Trigger heartbeat request.
2020-05-17 10:02:23,587 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000005.
2020-05-17 10:02:23,590 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000006.
2020-05-17 10:02:23,592 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000004.
2020-05-17 10:02:23,595 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000003.
2020-05-17 10:02:23,598 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] - Received heartbeat from container_1589453804748_0118_01_000002.
2020-05-17 10:02:23,725 DEBUG org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received acknowledge message for checkpoint 12 from task 459efd2ad8fe2ffe7fffe28530064fe1 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at container_1589453804748_0118_01_000002 @ ip-172-31-43-69.eu-central-1.compute.internal (dataPort=44625).
2020-05-17 10:02:29,103 DEBUG org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received acknowledge message for checkpoint 12 from task 266a9326be7e3ec669cce2e6a97ae5b0 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at container_1589453804748_0118_01_000005 @ ip-172-31-42-229.eu-central-1.compute.internal (dataPort=37329).
2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@ip-172-31-42-229.eu-central-1.compute.internal:39999] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink@ip-172-31-42-229.eu-central-1.compute.internal:42567] has failed, address is now gated for [50] ms. Reason: [Disassociated]
2020-05-17 10:02:32,900 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Map -> Flat Map (87/160) (cb77c7002503baa74baf73a3a100c2f2) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: readAddress(..) failed: Connection reset by peer (connection to 'ip-172-31-42-229.eu-central-1.compute.internal/172.31.42.229:37329')

There are also JobManager heartbeat timeouts but they don't correlate with the issue.

Attachments

Issue Links

relates to

FLINK-17813 Manually test unaligned checkpoints on a cluster

Resolved

Activity

People

Assignee:: Roman Khachatryan

Reporter:: Roman Khachatryan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/May/20 17:17

Updated:: 27/Apr/21 18:04

Resolved:: 02/Jun/20 09:08