[FLINK-12852] [hotfix timeout] Deadlock occurs when requiring exclusive buffer for RemoteInputChannel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.7.2, 1.8.1, 1.9.0
Fix Version/s: 1.9.0
Component/s: Runtime / Network
Labels:
- pull-request-available

Release Note:

Hide
To avoid a potential deadlock, this adds a timeout (default value of 30 seconds, configurable via {{taskmanager.network.memory.exclusive-buffers-request-timeout-ms}}) for how long Task will be waiting for assignment of exclusive memory segments. However it is possible that for some previously working deployments this default timeout value is too low and might has to be increased.

Show
To avoid a potential deadlock, this adds a timeout (default value of 30 seconds, configurable via {{taskmanager.network.memory.exclusive-buffers-request-timeout-ms}}) for how long Task will be waiting for assignment of exclusive memory segments. However it is possible that for some previously working deployments this default timeout value is too low and might has to be increased.

Description

When running tests with an upstream vertex and downstream vertex, deadlock occurs when submitting the job:

"Sink: Unnamed (3/500)" #136 prio=5 os_prio=0 tid=0x00007f2cca81b000 nid=0x38845 waiting on condition [0x00007f2cbe9fe000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000073ed6b6f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:233)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:180)
at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.requestMemorySegments(NetworkBufferPool.java:54)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.assignExclusiveSegments(RemoteInputChannel.java:139)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.assignExclusiveSegments(SingleInputGate.java:312)
- locked <0x000000073fbc81f0> (a java.lang.Object)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:220)
at org.apache.flink.runtime.taskmanager.Task.setupPartionsAndGates(Task.java:836)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:598)
at java.lang.Thread.run(Thread.java:834)

This is due to the required and max of local buffer pool is not the same and there may be over-allocation, when assignExclusiveSegments there are no available memory.

The detail of the scenarios is as follows: The parallelism of both upstream vertex and downstream vertex are 1500 and 500 respectively. There are 200 TM and each TM has 10696 buffers( in total and has 10 slots. For a TM that runs 9 upstream tasks and 1 downstream task, the 9 upstream tasks start first with local buffer pool {required = 500, max = 2 * 500 + 8 = 1008}, it produces data quickly and each occupy about 990 buffers. Then the DownStream task starts and try to assigning exclusive buffers for 1500 -9 = 1491 InputChannels. It requires 2981 buffers but only 1786 left. Since not all downstream tasks can start, the job will be blocked finally and no buffer can be released, and the deadlock finally occurred.

I think although increasing the network memory solves the problem, the deadlock may not be acceptable. Fined grained resource management Flink-12761 can solve this problem, but AFAIK in 1.9 it will not include the network memory into the ResourceProfile.

Attachments

Issue Links

is related to

FLINK-15031 Automatically calculate required network memory for fine-grained jobs

Closed

relates to

FLINK-13203 [proper fix] Deadlock occurs when requiring exclusive buffer for RemoteInputChannel

Open

FLINK-24035 Fix the deadlock issue caused by buffer listeners may not be notified

Closed

links to

GitHub Pull Request #8925

Activity

People

Assignee:: Yun Gao

Reporter:: Yun Gao

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 14/Jun/19 10:20

Updated:: 31/Aug/21 07:17

Resolved:: 11/Jul/19 07:26

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m