When executing shuffle tasks, TCP connections(on port 7337 by default) will be established by shuffle service.
It will like:
However, some of the TCP connections are still busy when the task is actually finished. These connections won't close automatically until we restart the NodeManager process.
Connections pile up and NodeManagers are getting slower and slower.
These unclosed TCP connections stay busy and it seem doesn't take effect when I set ChannelOption.SO_KEEPALIVE to true according to SPARK-23182.
So the solution is setting ChannelOption.AUTO_CLOSE to true, and after which our cluster(running 10000+ jobs / day) is processing normally.