Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.17
-
None
-
None
Description
On gw56.iu.xsede.org, where the develop branch of airavata is deployed, there are currently over 4,000 Zookeeper connections in TIME_WAIT state.
[airavata@gw56 ~]$ netstat -anp --tcp | grep 2181 | grep TIME_WAIT | wc -l (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) 4758
This number is fairly constant during the time I've been watching it. On gw77.iu.xsede.org where the master branch is deployed, there are none of these TIME_WAIT connections.
I looked into this a bit and wrote the following on HipChat
[5:41 PM] Marcus Christie: From what I've been reading, I think the TIME_WAIT problem must be coming from Zookeeper clients connecting and then closing over and over again.
[5:42 PM] Marcus Christie: A TCP connection will stay in TIME_WAIT for about 4 minutes after it is closed http://stackoverflow.com/questions/10726049/what-is-the-reason-for-time-wait-connection-increasing-i...
[5:44 PM] Marcus Christie: There are consistently about 4,000 connections in TIME_WAIT. If they hang around for 4 minutes (240 seconds), then that means there must be 16.667 new connections being created (and eventually closed) each second.
Other things:
- smarru already tried purging old logs, see the Zookeeper docs
- Zookeeper has some administrative commands that are useful for finding out it's self-reported statistics about number of connections, etc.
- to run these do
telnet localhost 2181 stat
- to run these do
- useful links on TIME_WAIT
- http://serverfault.com/questions/329845/how-to-forcibly-close-a-socket-in-time-wait
- http://stackoverflow.com/questions/10726049/what-is-the-reason-for-time-wait-connection-increasing-in-java
- http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html