There is a socket descriptor leakage in a pyspark streaming app when configured with batch interval more then 30 seconds. This is due to default timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 seconds of inactivity and creates new one next time. That connection doesn't get closed on the python CallbackServer side and keep piling up until it eventually blocks new connections.
- Submit attached bug.py to spark
- Watch /tmp/bug.log to see the increasing total number of py4j callback connections of which 0 will ever be closed
- You can confirm the reality by using lsof on the pyspark driver process:
- If you leave it running for long enough the CallbackServer will eventually become unable to accept new connections from the gateway and the app will crash: