Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.16.0
Description
In the scenario where the buffer contention is fierce, sometimes the task hang can be observed. Through the thread dump information, we can found that the task thread is blocked by requestMemorySegmentBlocking forever. After investigating the dumped heap information, I found that the NetworkBufferPool actually has many buffers, but the LocalBufferPool is still unavailable and no buffer has been obtained.
By looking at the code, I am sure that this is a bug in thread race: when the task thread polled out the last buffer in LocalBufferPool and triggered the onGlobalPoolAvailable callback itself, it will skip this notification (as currently the LocalBufferPool is available), which will cause the BufferPool to eventually become unavailable and will never register a callback to the NetworkBufferPool.
The conditions for triggering the problem are relatively strict, but I have found a stable way to reproduce it, I will try to fix and verify this problem.
Attachments
Attachments
Issue Links
- causes
-
FLINK-29419 HybridShuffleITCase.testHybridFullExchangesRestart hangs
- Closed
-
FLINK-29923 Hybrid Shuffle may face deadlock when running a task need to execute big size data
- Closed
- links to