For task failure or canceling, the SingleInputGate#releaseAllResources will be invoked before task exits.
In the process of SingleInputGate#releaseAllResources, we first loop to release all the input channels, then destroy the BufferPool. For RemoteInputChannel#releaseAllResources, it will return floating buffers to the BufferPool which assigns this recycled buffer to the other listeners(RemoteInputChannel).
It may exist recursive call in this process. If the listener is already released before, it will directly recycle this buffer to the BufferPool which takes another listener to notify available buffer. The above process may be invoked repeatedly in recursive way.
If there are many input channels as listeners in the BufferPool, it will cause StackOverflow error because of recursion. And in our testing job, the scale of 10,000 input channels ever caused this error.
I think of two ways for solving this potential problem:
- When the input channel is released, it should notify the BufferPool of unregistering this listener, otherwise it is inconsistent between them.
- SingleInputGate should destroy the BufferPool first, then loop to release all the internal input channels. To do so, all the listeners in BufferPool will be removed during destroying, and the input channel will not have further interactions during RemoteInputChannel#releaseAllResources.
I prefer the second way to solve this problem, because we do not want to expand another interface method for removing buffer listener, further currently the internal data structure in BufferPool can not support remove a listener directly.