Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.13.2
-
None
Description
Based on the observation of logs and metrics, it was found that a subtask deployed on a same TM consistently reported an exception of requesting exclusive buffers timeout. It was discovered that during the restart process, 【Network】 metric remained unchanged (heap memory usage did change). I suspect that the network buffer memory was not properly released during the restart process, which caused the newly deployed task to fail to obtain the network buffer. This problem persisted despite repeated restarts, and the application failed to recover automatically.
(I'm not sure if there are other reasons for this issue)
Attached below are screenshots of the exception stack and relevant metrics:
2024-03-08 09:58:18,738 WARN org.apache.flink.runtime.taskmanager.Task [] - GroupWindowAggregate switched from DEPLOYING to FAILED with failure cause: java.io.IOException: Timeout triggered when requesting exclusive buffers: The total number of network buffers is currently set to 32768 of 32768 bytes each. You can increase this number by setting the configuration keys 'taskmanager.memory.network.fraction', 'taskmanager.memory.network.min', and 'taskmanager.memory.network.max', or you may increase the timeout which is 30000ms by setting the key 'taskmanager.network.memory.exclusive-buffers-request-timeout-ms'. at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.internalRequestMemorySegments(NetworkBufferPool.java:246) at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestPooledMemorySegmentsBlocking(NetworkBufferPool.java:169) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.reserveSegments(LocalBufferPool.java:247) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setupChannels(SingleInputGate.java:427) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:257) at org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:84) at org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:952) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:655) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) at java.lang.Thread.run(Thread.java:748)
Network metric:Only this TM is always 100%, without any variation.
The status of the task deployed to this TM cannot be RUNNING and the status change is slow
Although the root exception thrown by the application is PartitionNotFoundException, the actual underlying root cause exception log found is IOException: Timeout triggered when requesting exclusive buffers
Attachments
Attachments
Issue Links
- is fixed by
-
FLINK-23724 Network buffer leak when ResultPartition is released (failover)
- Closed