Solaris 10 x86, jdk 1.7. We use tomcat clustering for session replication with 4 nodes and sometimes 8 nodes. We get a ConcurrentModificationException occasionally on shutdown. I have been unable to reliably reproduce the exception. In the log, I see "Unable to close cluster receiver selector." with the exception below: java.util.ConcurrentModificationException java.util.HashMap$HashIterator.nextEntry(HashMap.java:894) java.util.HashMap$KeyIterator.next(HashMap.java:928) java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1067) org.apache.catalina.tribes.transport.nio.NioReceiver.closeSelector(NioReceiver.java:382) org.apache.catalina.tribes.transport.nio.NioReceiver.stopListening(NioReceiver.java:365) org.apache.catalina.tribes.transport.nio.NioReceiver.stop(NioReceiver.java:86) org.apache.catalina.tribes.group.ChannelCoordinator.internalStop(ChannelCoordinator.java:203) org.apache.catalina.tribes.group.ChannelCoordinator.stop(ChannelCoordinator.java:115) org.apache.catalina.tribes.group.ChannelInterceptorBase.stop(ChannelInterceptorBase.java:178) org.apache.catalina.tribes.group.ChannelInterceptorBase.stop(ChannelInterceptorBase.java:178) org.apache.catalina.tribes.group.ChannelInterceptorBase.stop(ChannelInterceptorBase.java:178) org.apache.catalina.tribes.group.interceptors.MessageDispatchInterceptor.stop(MessageDispatchInterceptor.java:172) org.apache.catalina.tribes.group.ChannelInterceptorBase.stop(ChannelInterceptorBase.java:178) org.apache.catalina.tribes.group.ChannelInterceptorBase.stop(ChannelInterceptorBase.java:178) org.apache.catalina.tribes.group.GroupChannel.stop(GroupChannel.java:438) org.apache.catalina.ha.tcp.SimpleTcpCluster.stopInternal(SimpleTcpCluster.java:744) org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) org.apache.catalina.core.ContainerBase.stopInternal(ContainerBase.java:1199) org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) org.apache.catalina.core.StandardService.stopInternal(StandardService.java:502) org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) org.apache.catalina.core.StandardServer.stopInternal(StandardServer.java:753) org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232) org.apache.catalina.startup.Catalina.stop(Catalina.java:751) org.apache.catalina.startup.Catalina.start(Catalina.java:713) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:601) org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:322) org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:451) I looked at the code briefly and noticed the use of a SelectionKey Iterator. I have not dug deep enough to find any issue in the NioReceiver code, but I did find this potentially relevant text in the Selector javadocs (http://docs.oracle.com/javase/7/docs/api/java/nio/channels/Selector.html): "A selector's key and selected-key sets are not, in general, safe for use by multiple concurrent threads. If such a thread might modify one of these sets directly then access should be controlled by synchronizing on the set itself. The iterators returned by these sets' iterator methods are fail-fast: If the set is modified after the iterator is created, in any way except by invoking the iterator's own remove method, then a ConcurrentModificationException will be thrown."
Thanks for the report. I did see a code path that would result in two threads both trying to close the Selector. This, and some other threading issues, have been fixed in trunk and 7.0.x and will be included in 7.0.33 onwards. I can't be 100% certain that the problem I found is the one you were seeing but it looks very much like it.
I verified that this problem still exists in 7.0.41. I tracked down and fixed the problem. Problem: During shutdown, NioReceiver.close() is called which ultimately calls NioReceiver.closeSelector(). selector.keys() is called and an iteration is performed - closing the existing channels and canceling the keys. Per the Selector javadocs, the set cannot be modified during iteration by another thread or else a ConcurrentModificationException will be thrown. The set is in fact being modified by another thread (NioReceiver thread) when existing connections are closed (see listen() and canceledKey(SelectionKey) methods). The problem is difficult to reproduce because of timing. I was able to reliably reproduce the problem but I had to add a sleep in the closeSelector() method and or set breakpoints during testing. Ultimately, I couldn't come up with a test case that didn't need a sleep in closeSelector() so I don't have a test case for the attached patch. The attached patch loops (up to 3 times) catching ConcurrentModificationException during close and sleeps for a brief period (100ms * attempt number) allowing the NioReceiver thread to finish it's loop. After the sleep, it attempts to close channels again.
Created attachment 30489 [details] fix for bug
Created attachment 30490 [details] fix for bug i inadvertently left in a printStackTrace on the original patch. use this one instead.
(In reply to Casey Lucas from comment #2) > Problem: > During shutdown, NioReceiver.close() There is no such method. I'm going to assume you mean stop().
Thanks for the analysis and patch. I went with a different approach as I generally prefer to tackle the root cause rather than handle the symptoms of a problem. Providing that the receiver thread stops in a timely manner, the patch prevents multiple threads accessing the key set.
(In reply to Mark Thomas from comment #5) > (In reply to Casey Lucas from comment #2) > > Problem: > > During shutdown, NioReceiver.close() > > There is no such method. I'm going to assume you mean stop(). Yes, stop. Thanks for the fix.