It is reproducible very easily on stock solr with SSL enabled. My test setup creates two SSL-enabled Solr instances with a 5 shard x 2 replica collection and runs a short indexing program (just 9 update requests with 1 document each and a commit at the end). Keep on running the indexing program repeatedly and the number of connections in the CLOSE_WAIT state gradually increase.
Interestingly, the number of connections stuck in CLOSE_WAIT decrease during indexing and increase again about 10 or so seconds after the indexing is stopped.
I can reproduce the problem on 6.1, 6.0, 5.5.1, 5.3.2. I am not able to reproduce this on master although I don't see anything relevant that has changed since 6.1 – I tried this only once so it may have just been bad timing?
When the connections show in CLOSE_WAIT state, the recv-q buffer always has exactly 70 bytes.
netstat -tonp | grep CLOSE_WAIT | grep java
tcp 70 0 127.0.0.1:56538 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:47995 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:47477 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:47996 127.0.1.1:8984 CLOSE_WAIT 21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:56644 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0)
tcp 70 0 127.0.0.1:56533 127.0.1.1:8983 CLOSE_WAIT 21654/java off (0.00/0/0)
If I run the same steps with SSL disabled then the connections in CLOSE_WAIT state have just 1 byte in recv-q. I don't see the number of such connections increasing with indexing over time but I know for a fact (from a client) that eventually more and more connections pile up in this state even without SSL.
tcp 1 0 127.0.0.1:41723 127.0.1.1:8983 CLOSE_WAIT 2522/java off (0.00/0/0)
tcp 1 0 127.0.0.1:41780 127.0.1.1:8983 CLOSE_WAIT 2640/java off (0.00/0/0)
I enabled debug logging for PoolingHttpClientConnectionManager (used in 6.x) and PoolingClientConnectionManager (used in 5.x.x) and after running the indexing program and verifying that some connections are in CLOSE_WAIT, I grepped the logs for connections leased vs released and I always find the number to be the same which means that the connections are always given back to the pool.
Now some connections hanging around in CLOSE_WAIT are to be expected because of the following (quoted from the httpclient documentation):
One of the major shortcomings of the classic blocking I/O model is that the network socket can react to I/O events only when blocked in an I/O operation. When a connection is released back to the manager, it can be kept alive however it is unable to monitor the status of the socket and react to any I/O events. If the connection gets closed on the server side, the client side connection is unable to detect the change in the connection state (and react appropriately by closing the socket on its end).
HttpClient tries to mitigate the problem by testing whether the connection is 'stale', that is no longer valid because it was closed on the server side, prior to using the connection for executing an HTTP request. The stale connection check is not 100% reliable. The only feasible solution that does not involve a one thread per socket model for idle connections is a dedicated monitor thread used to evict connections that are considered expired due to a long period of inactivity. The monitor thread can periodically call ClientConnectionManager#closeExpiredConnections() method to close all expired connections and evict closed connections from the pool. It can also optionally call ClientConnectionManager#closeIdleConnections() method to close all connections that have been idle over a given period of time.
I'm going to try adding such a monitor thread and see if this is still a problem.