Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.17.0, 1.16.1, 1.18.0
Description
In our TPC-DS test, we found that in the case of fierce competition in network memory, some tasks may hanging forever.
From the thread dump information, we can see that the task is waiting for the LocalBufferPool to become available. It is strange that other tasks have finished and released network memory already. Undoubtedly, this is an unexpected behavior, which implies that there must be a bug in the LocalBufferPool or NetworkBufferPool.
By dumping the heap memory, we can find a strange phenomenon that there are available buffers in the LocalBufferPool, but it was considered to be un-available. Another thing to note is that it now holds an overdraft buffer.
TL;DR: This problem occurred in multi-thread race related to the introduction of overdraft buffer.
Suppose we have two threads, called A and B. For simplicity, LocalBufferPool is called LocalPool and NetworkBufferPool is called GlobalPool.
Thread A continuously request buffers blocking from the LocalPool.
Thread B continuously return buffers to GlobalPool.
- If thread A takes the last available buffer of LocalPool, but GlobalPool does not have a buffer at this time, it will register a callback function with GlobalPool.
- Thread B returns one buffer to GlobalPool, but has not started to trigger the callback.
- Thread A continues to request buffer. Because the availableMemorySegments of LocalPool is empty, it requests the overdraftBuffer instead. But there is already a buffer in the GlobalPool, it successfully gets the buffer.
- Thread B triggers the callback. Since there is no buffer in GlobalPool now, the callback is re-registered.
- Thread A continues to request buffer. Because there is no buffer in GlobalPool, it will block on CompletableFuture#get.
- Thread B continues to return a buffer and triggers the recently registered callback. As a result, LocalPool puts the buffer into availableMemorySegments. Unfortunately, the current logic of shouldBeAvailable method is: if there is an overdraft buffer, LocalPool is considered as un-available.
- Thread A will keep blocking forever.
Attachments
Attachments
Issue Links
- fixes
-
FLINK-31104 TPC-DS test timed out in query 36
- Open
- is caused by
-
FLINK-26762 Add the overdraft buffer in BufferPool to reduce unaligned checkpoint being blocked
- Closed
- links to