Description
I deploy flink cluster (version: 1.16.2) and it run normally about 2 months, but recently i meet a problem. I see some sub tasks back pressure is high and the flink job is totally blocked(in pic1.jpg), these sub tasks are all in one task manager. so i stop the abnormal task manager and deploy flink job again, the problem is solved. I find some error log in the abnormal task manager:
2024-03-03 15:57:25,088 ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue [] - Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection timed out
I check the abnormal task manager deployed machine. cpu, memory, network is as normal as other task manager deployed machine, so it doesn't look like a hardware problem.
What does it mean?
What should i do to solve this problem completely?