Description
The SocketServer has threads for Acceptors and Processors. These threads communicate via Processor#accept/Processor#configureNewConnections and the `newConnections` queue.
During shutdown, the Acceptor and Processors are each stopped by setting shouldRun to false, and then shutdown proceeds asynchronously in all instances together. This leads to a race condition where an Acceptor accepts a SocketChannel and queues it to a Processor, but that Processor instance has already started shutting down and has already drained the newConnections queue.
KAFKA-16765 is an analogous bug in NioEchoServer, which uses a completely different implementation but has the same flaw.
An example execution order that includes this leak:
1. Acceptor#accept() is called, and a new SocketChannel is accepted.
2. Acceptor#assignNewConnection() begins
3. Acceptor#close() is called, which sets shouldRun to false in the Acceptor and attached Processor instances
4. Processor#run() checks the shouldRun variable, and exits the loop
5. Processor#closeAll() executes, and drains the `newConnections` variable
6. Processor#run() returns and the Processor thread terminates
7. Acceptor#assignNewConnection() calls Processor#accept(), which adds the SocketChannel to `newConnections`
8. Acceptor#assignNewConnection() returns
9. Acceptor#run() checks the shouldRun variable and exits the loop, and the Acceptor thread terminates.
10. Acceptor#close() joins all of the terminated threads, and returns
At the end of this sequence, there are still open SocketChannel instances in newConnections, which are then considered leaked.
Attachments
Issue Links
- Discovered while testing
-
KAFKA-15845 Add Junit5 test extension which detects leaked Kafka clients and servers
- In Progress
- is related to
-
KAFKA-16765 NioEchoServer leaks accepted SocketChannel instances due to race condition
- Resolved