[FLINK-28695] Fail to send partition request to restarted taskmanager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.15.0, 1.15.1
Fix Version/s: 1.17.0, 1.16.1, 1.15.4
Component/s: Deployment / Kubernetes, Runtime / Network
Labels:
- pull-request-available

Description

After upgrade to 1.15.1 we started getting error while running JOB

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed.    at org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient$1.operationComplete(NettyPartitionRequestClient.java:145)    ....

Caused by: org.apache.flink.shaded.netty4.io.netty.channel.StacklessClosedChannelException atrg.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

After investigation we managed narrow it down to the exact behavior then this issue happens:

Deploying JOB on fresh kubernetes session cluster with multiple TaskManagers: TM1 and TM2 is successful. Job has multiple partitions running on both TM1 and TM2.
One TaskManager TM2 (XXX.XXX.XX.32) fails for unrelated issue. For example OOM exception.
Kubernetes POD with mentioned TaskManager TM2 is restarted. POD retains same IP address as before.
JobManager is able to pickup the restarted TM2 (XXX.XXX.XX.32)
JOB is restarted because it was running on the failed TaskManager TM2
TM1 data channel to TM2 is closed and we get LocalTransportException: Sending the partition request to '/XXX.XXX.XX.32:6121 (#0)' failed during JOB running stage.
When we explicitly delete pod with TM2 it creates new POD with different IP address and JOB is able to start again.

Important to note that we didn't encountered this issue with previous 1.14.4 version and TaskManager restarts didn't cause such error.

Please note attached kubernetes deployments and reduced logs from JobManager. TaskManager logs did show errors before error, but doesn't show anything significant after restart.

EDIT:

Setting taskmanager.network.max-num-tcp-connections to a very high number workarounds the problem

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

deployment.txt
26/Jul/22 13:43
7 kB
Simonas
image.png
25/Oct/23 20:47
157 kB
Vitor
image-1.png
25/Oct/23 20:47
95 kB
Vitor
image-2022-11-20-16-16-45-705.png
20/Nov/22 08:16
95 kB
Rui Fan
image-2022-11-21-17-15-58-749.png
21/Nov/22 09:16
157 kB
Rui Fan
job_log.txt
26/Jul/22 13:55
3 kB
Simonas
jobmanager_config.txt
26/Jul/22 13:49
2 kB
Simonas
jobmanager_logs.txt
26/Jul/22 13:54
0.5 kB
Simonas
pod_restart.txt
26/Jul/22 13:53
0.6 kB
Simonas
taskmanager_config.txt
26/Jul/22 13:49
1 kB
Simonas

Issue Links

is caused by

FLINK-15455 Enable TCP connection reuse across multiple jobs.

Closed

links to

GitHub Pull Request #21351

Activity

People

Assignee:: Rui Fan

Reporter:: Simonas

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 26/Jul/22 14:00

Updated:: 25/Oct/23 20:47

Resolved:: 21/Nov/22 18:51