Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6579

SnapPuller Replication blocks clean shutdown of tomcat

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 4.10.1
    • 4.10.4, 5.0, 6.0
    • replication (java)
    • None

    Description

      main issue was described in the mailing list here:
      http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201409.mbox/browser
      and
      here:

      but also including the quotes:
      original message from Nick

      Hello,
      I have solr 4.10 running on tomcat 7. I'm doing replication from one master
      to about 10 slaves, with standard configuration:

            <requestHandler name="/replication" class="solr.ReplicationHandler" >
                 <lst name="master">
                   <str name="enable">${enable.master:false}</str>
                   <str name="replicateAfter">commit</str>
                   <str name="replicateAfter">startup</str>
                   <str name="confFiles">schema.xml,stopwords.txt</str>
                   <!-- str name="commitReserveDuration">00:00:10</str -->
                 </lst>
                 <lst name="slave">
                   <str name="enable">${enable.slave:false}</str>
                   <str name="masterUrl">http://master:8080/solr/mycore</str>
                   <str name="pollInterval">00:00:60</str>
                 </lst>
            </requestHandler>
      

      It appears that if tomcat gets shutdown while solr is replicating, it
      prevents tomcat from shutting down fully. Immediately after receiving the
      shutdown command, a thread dump is logged into catalina.out (this may have
      been turned on by some configuration someone else on my team made). I
      removed some threads that didn't look related, mostly about tomcat session
      replication, or with names like "http-bio-8080-exec-10".

          62252 [http-bio-8080-exec-1] INFO  org.apache.solr.core.SolrCore  –
      [mycore] webapp=/solr path=/replication
      params={command=details&_=1412014928648&wt=json} status=0 QTime=6
          63310 [http-bio-8080-exec-1] INFO  org.apache.solr.core.SolrCore  –
      [mycore] webapp=/solr path=/replication
      params={command=details&_=1412014929699&wt=json} status=0 QTime=6
          2014-09-29 14:22:10
          Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed
      mode):
      
          "fsyncService-12-thread-1" prio=10 tid=0x00007f3bd4002000 nid=0x203d
      waiting on condition [0x00007f3c271f0000]
             java.lang.Thread.State: WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x00000007e1ff4458> (a
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
                  at
      java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
                  at
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
                  at
      java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
                  at
      java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
      
          "explicit-fetchindex-cmd" daemon prio=10 tid=0x00007f3c0413e800
      nid=0x203c runnable [0x00007f3c272f1000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.read(SocketInputStream.java:152)
                  at java.net.SocketInputStream.read(SocketInputStream.java:122)
                  at
      org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
                  at
      org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:174)
                  at
      org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
                  at
      org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
                  at
      org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:114)
                  at
      org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:152)
                  at
      org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchPackets(SnapPuller.java:1239)
                  at
      org.apache.solr.handler.SnapPuller$DirectoryFileFetcher.fetchFile(SnapPuller.java:1187)
                  at
      org.apache.solr.handler.SnapPuller.downloadIndexFiles(SnapPuller.java:774)
                  at
      org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:424)
                  at
      org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:337)
                  at
      org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:227)
      
          "process reaper" daemon prio=10 tid=0x00007f3c0409c000 nid=0x203b
      waiting on condition [0x00007f3c984e9000]
             java.lang.Thread.State: TIMED_WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x00000007dfbfd890> (a
      java.util.concurrent.SynchronousQueue$TransferStack)
                  at
      java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
                  at
      java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
                  at
      java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
                  at
      java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
                  at
      java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
      
          "process reaper" daemon prio=10 tid=0x00007f3c0409a000 nid=0x2039
      waiting on condition [0x00007f3cac040000]
             java.lang.Thread.State: TIMED_WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x00000007dfbfd890> (a
      java.util.concurrent.SynchronousQueue$TransferStack)
                  at
      java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
                  at
      java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
                  at
      java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:359)
                  at
      java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:942)
                  at
      java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
      
          "ajp-bio-8009-AsyncTimeout" daemon prio=10 tid=0x00007f3ce013c000
      nid=0x202e waiting on condition [0x00007f3c27af9000]
             java.lang.Thread.State: TIMED_WAITING (sleeping)
                  at java.lang.Thread.sleep(Native Method)
                  at
      org.apache.tomcat.util.net.JIoEndpoint$AsyncTimeout.run(JIoEndpoint.java:148)
                  at java.lang.Thread.run(Thread.java:745)
      
          "ajp-bio-8009-Acceptor-0" daemon prio=10 tid=0x00007f3ce013a800
      nid=0x202d runnable [0x00007f3c985ea000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.PlainSocketImpl.socketAccept(Native Method)
                  at
      java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
                  at java.net.ServerSocket.implAccept(ServerSocket.java:530)
                  at java.net.ServerSocket.accept(ServerSocket.java:498)
                  at
      org.apache.tomcat.util.net.DefaultServerSocketFactory.acceptSocket(DefaultServerSocketFactory.java:60)
                  at
      org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:216)
                  at java.lang.Thread.run(Thread.java:745)
      
          "http-bio-8080-exec-1" daemon prio=10 tid=0x00007f3c0c001800 nid=0x202c
      runnable [0x00007f3c986eb000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.SocketInputStream.socketRead0(Native Method)
                  at java.net.SocketInputStream.read(SocketInputStream.java:152)
                  at java.net.SocketInputStream.read(SocketInputStream.java:122)
                  at
      org.apache.coyote.http11.InternalInputBuffer.fill(InternalInputBuffer.java:516)
                  at
      org.apache.coyote.http11.InternalInputBuffer.fill(InternalInputBuffer.java:501)
                  at
      org.apache.coyote.http11.Http11Processor.setRequestLineReadTimeout(Http11Processor.java:173)
                  at
      org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:927)
                  at
      org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
                  at
      org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)
                  - locked <0x00000007df0874b0> (a
      org.apache.tomcat.util.net.SocketWrapper)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
      
          "http-bio-8080-AsyncTimeout" daemon prio=10 tid=0x00007f3ce0139000
      nid=0x202b waiting on condition [0x00007f3cac1aa000]
             java.lang.Thread.State: TIMED_WAITING (sleeping)
                  at java.lang.Thread.sleep(Native Method)
                  at
      org.apache.tomcat.util.net.JIoEndpoint$AsyncTimeout.run(JIoEndpoint.java:148)
                  at java.lang.Thread.run(Thread.java:745)
      
          "http-bio-8080-Acceptor-0" daemon prio=10 tid=0x00007f3ce0138000
      nid=0x202a runnable [0x00007f3cac2ab000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.PlainSocketImpl.socketAccept(Native Method)
                  at
      java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
                  at java.net.ServerSocket.implAccept(ServerSocket.java:530)
                  at java.net.ServerSocket.accept(ServerSocket.java:498)
                  at
      org.apache.tomcat.util.net.DefaultServerSocketFactory.acceptSocket(DefaultServerSocketFactory.java:60)
                  at
      org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run(JIoEndpoint.java:216)
                  at java.lang.Thread.run(Thread.java:745)
      
          "ContainerBackgroundProcessor[StandardEngine[Catalina]]" daemon prio=10
      tid=0x00007f3ce05b4000 nid=0x2029 waiting on condition [0x00007f3cacbb7000]
             java.lang.Thread.State: TIMED_WAITING (sleeping)
                  at java.lang.Thread.sleep(Native Method)
                  at
      org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1508)
                  at java.lang.Thread.run(Thread.java:745)
      
          "clock:10000" daemon prio=10 tid=0x00007f3c5cafc000 nid=0x2028 in
      Object.wait() [0x00007f3c27bfa000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x00000007dc6cf7e8> (a
      org.webmacro.util.Clock$ClockThread)
                  at org.webmacro.util.Clock$ClockThread.run(Clock.java:108)
                  - locked <0x00000007dc6cf7e8> (a
      org.webmacro.util.Clock$ClockThread)
      
          "TimeLoop(1000,600)" daemon prio=10 tid=0x00007f3c5cac6000 nid=0x2027
      waiting on condition [0x00007f3c27cfb000]
             java.lang.Thread.State: TIMED_WAITING (sleeping)
                  at java.lang.Thread.sleep(Native Method)
                  at
      org.webmacro.util.TimeLoop$TimeLoopThread.run(TimeLoop.java:237)
      
          "pool-1-thread-2" daemon prio=10 tid=0x00007f3c5d1b3800 nid=0x2022
      waiting on condition [0x00007f3c98246000]
             java.lang.Thread.State: WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x000000070b90b4a0> (a
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
                  at
      java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
                  at
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
                  at
      java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
                  at
      java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
      
          "Java2D Disposer" daemon prio=10 tid=0x00007f3c5cfeb800 nid=0x2021 in
      Object.wait() [0x00007f3c9835d000]
             java.lang.Thread.State: WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x00000007e6030140> (a
      java.lang.ref.ReferenceQueue$Lock)
                  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
                  - locked <0x00000007e6030140> (a
      java.lang.ref.ReferenceQueue$Lock)
                  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
                  at sun.java2d.Disposer.run(Disposer.java:145)
                  at java.lang.Thread.run(Thread.java:745)
      
          "MySQL Statement Cancellation Timer" daemon prio=10
      tid=0x00007f3c5ca0c000 nid=0x2020 in Object.wait() [0x00007f3c9845e000]
             java.lang.Thread.State: WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x00000007e5f45ad8> (a java.util.TaskQueue)
                  at java.lang.Object.wait(Object.java:503)
                  at java.util.TimerThread.mainLoop(Timer.java:526)
                  - locked <0x00000007e5f45ad8> (a java.util.TaskQueue)
                  at java.util.TimerThread.run(Timer.java:505)
      
          "Timer-0" daemon prio=10 tid=0x00007f3c5c9f7800 nid=0x201f in
      Object.wait() [0x00007f3cad0bc000]
             java.lang.Thread.State: WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x00000007e605a8d8> (a java.util.TaskQueue)
                  at java.lang.Object.wait(Object.java:503)
                  at java.util.TimerThread.mainLoop(Timer.java:526)
                  - locked <0x00000007e605a8d8> (a java.util.TaskQueue)
                  at java.util.TimerThread.run(Timer.java:505)
      
          "Thread-5" daemon prio=10 tid=0x00007f3c5c00f800 nid=0x201e in
      Object.wait() [0x00007f3cac3ac000]
             java.lang.Thread.State: WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x000000070b92b230> (a java.lang.Object)
                  at java.lang.Object.wait(Object.java:503)
                  at org.apache.solr.core.CloserThread.run(CoreContainer.java:905)
                  - locked <0x000000070b92b230> (a java.lang.Object)
      
          "snapPuller-10-thread-1" prio=10 tid=0x00007f3c6c26e800 nid=0x201d
      waiting on condition [0x00007f3cac555000]
             java.lang.Thread.State: TIMED_WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x000000070b92b2a8> (a
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
                  at
      java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
                  at
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
                  at
      java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
                  at
      java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
                  at
      java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
      
          "searcherExecutor-6-thread-1" prio=10 tid=0x00007f3c6c1ea000 nid=0x201c
      waiting on condition [0x00007f3cac810000]
             java.lang.Thread.State: WAITING (parking)
                  at sun.misc.Unsafe.park(Native Method)
                  - parking to wait for  <0x000000070b92b440> (a
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
                  at
      java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
                  at
      java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
                  at
      java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
                  at
      java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
      
          "GC Daemon" daemon prio=10 tid=0x00007f3ce0536000 nid=0x200d in
      Object.wait() [0x00007f3cd8137000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x000000070bbee5b0> (a sun.misc.GC$LatencyLock)
                  at sun.misc.GC$Daemon.run(GC.java:117)
                  - locked <0x000000070bbee5b0> (a sun.misc.GC$LatencyLock)
      
          "Service Thread" daemon prio=10 tid=0x00007f3ce00a6000 nid=0x200b
      runnable [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
      
          "C2 CompilerThread1" daemon prio=10 tid=0x00007f3ce00a4000 nid=0x200a
      waiting on condition [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
      
          "C2 CompilerThread0" daemon prio=10 tid=0x00007f3ce00a1000 nid=0x2009
      waiting on condition [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
      
          "Signal Dispatcher" daemon prio=10 tid=0x00007f3ce009f800 nid=0x2008
      waiting on condition [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
      
          "Finalizer" daemon prio=10 tid=0x00007f3ce0077800 nid=0x2007 in
      Object.wait() [0x00007f3cd8c1b000]
             java.lang.Thread.State: WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x000000070bbee828> (a
      java.lang.ref.ReferenceQueue$Lock)
                  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
                  - locked <0x000000070bbee828> (a
      java.lang.ref.ReferenceQueue$Lock)
                  at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
                  at
      java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
      
          "Reference Handler" daemon prio=10 tid=0x00007f3ce0075800 nid=0x2006 in
      Object.wait() [0x00007f3cadffe000]
             java.lang.Thread.State: WAITING (on object monitor)
                  at java.lang.Object.wait(Native Method)
                  - waiting on <0x000000070b8b23b8> (a
      java.lang.ref.Reference$Lock)
                  at java.lang.Object.wait(Object.java:503)
                  at
      java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
                  - locked <0x000000070b8b23b8> (a java.lang.ref.Reference$Lock)
      
          "main" prio=10 tid=0x00007f3ce0009000 nid=0x1ffc runnable
      [0x00007f3ce675c000]
             java.lang.Thread.State: RUNNABLE
                  at java.net.PlainSocketImpl.socketAccept(Native Method)
                  at
      java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
                  at java.net.ServerSocket.implAccept(ServerSocket.java:530)
                  at java.net.ServerSocket.accept(ServerSocket.java:498)
                  at
      org.apache.catalina.core.StandardServer.await(StandardServer.java:452)
                  at org.apache.catalina.startup.Catalina.await(Catalina.java:766)
                  at org.apache.catalina.startup.Catalina.start(Catalina.java:712)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at
      sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                  at
      sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                  at java.lang.reflect.Method.invoke(Method.java:606)
                  at
      org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:322)
                  at
      org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:456)
      
          JNI global references: 422
      
          Heap
           PSYoungGen      total 1184256K, used 1031985K [0x00000007aaa80000,
      0x0000000800000000, 0x0000000800000000)
            eden space 970240K, 94% used
      [0x00000007aaa80000,0x00000007e259e740,0x00000007e5e00000)
            from space 214016K, 55% used
      [0x00000007e5e00000,0x00000007ed2adfd0,0x00000007f2f00000)
            to   space 212480K, 0% used
      [0x00000007f3080000,0x00000007f3080000,0x0000000800000000)
           ParOldGen       total 1398272K, used 209681K [0x0000000700000000,
      0x0000000755580000, 0x00000007aaa80000)
            object space 1398272K, 14% used
      [0x0000000700000000,0x000000070ccc4700,0x0000000755580000)
           PSPermGen       total 78848K, used 78659K [0x00000006c0000000,
      0x00000006c4d00000, 0x0000000700000000)
            object space 78848K, 99% used
      [0x00000006c0000000,0x00000006c4cd0ca8,0x00000006c4d00000)
      
      right after this, here comes the shutdown:
      
          Sep 29, 2014 2:22:10 PM org.apache.catalina.core.StandardServer await
          INFO: A valid shutdown command was received via the shutdown port.
      Stopping the Server instance.
          Sep 29, 2014 2:22:10 PM org.apache.coyote.AbstractProtocol pause
          INFO: Pausing ProtocolHandler ["http-bio-8080"]
          Sep 29, 2014 2:22:10 PM org.apache.coyote.AbstractProtocol pause
          INFO: Pausing ProtocolHandler ["ajp-bio-8009"]
          Sep 29, 2014 2:22:11 PM org.apache.catalina.core.StandardService
      stopInternal
          INFO: Stopping service Catalina
          Sep 29, 2014 2:22:11 PM
      org.apache.catalina.ha.session.JvmRouteBinderValve stopInternal
          INFO: JvmRouteBinderValve stopped
          Sep 29, 2014 2:22:11 PM org.apache.catalina.ha.session.DeltaManager
      stopInternal
          INFO: Manager [localhost#/dy] expiring sessions upon shutdown
      
      (snip some stuff about an unrelated webapp, then the solr errors)
      
          64876 [localhost-startStop-2] INFO  org.apache.solr.core.CoreContainer
       – Shutting down CoreContainer instance=1359815787
          64877 [localhost-startStop-2] INFO  org.apache.solr.core.SolrCore  –
      [mycore]  CLOSING SolrCore org.apache.solr.core.SolrCore@1ade5d50
          64882 [localhost-startStop-2] INFO
       org.apache.solr.update.UpdateHandler  – closing
      DirectUpdateHandler2{commits=0,autocommit
      maxTime=15000ms,autocommits=0,soft
      autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
          64883 [localhost-startStop-2] INFO
       org.apache.solr.update.SolrCoreState  – Closing SolrCoreState
          64883 [localhost-startStop-2] INFO
       org.apache.solr.update.DefaultSolrCoreState  – SolrCoreState ref count has
      reached 0 - closing IndexWriter
          64883 [localhost-startStop-2] INFO
       org.apache.solr.update.DefaultSolrCoreState  – closing IndexWriter with
      IndexWriterCloser
          64885 [localhost-startStop-2] ERROR
      org.apache.solr.update.DefaultSolrCoreState  – Error during shutdown of
      writer.
          org.apache.lucene.store.AlreadyClosedException: this IndexWriter is
      closed
                  at
      org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:700)
                  at
      org.apache.lucene.index.IndexWriter.waitForMerges(IndexWriter.java:2384)
                  at
      org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:795)
                  at
      org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:70)
                  at
      org.apache.solr.update.DefaultSolrCoreState.close(DefaultSolrCoreState.java:371)
                  at
      org.apache.solr.update.SolrCoreState.decrefSolrCoreState(SolrCoreState.java:72)
                  at org.apache.solr.core.SolrCore.close(SolrCore.java:1073)
                  at org.apache.solr.core.SolrCores.close(SolrCores.java:117)
                  at
      org.apache.solr.core.CoreContainer.shutdown(CoreContainer.java:347)
                  at
      org.apache.solr.servlet.SolrDispatchFilter.destroy(SolrDispatchFilter.java:200)
                  at
      org.apache.catalina.core.ApplicationFilterConfig.release(ApplicationFilterConfig.java:315)
                  at
      org.apache.catalina.core.StandardContext.filterStop(StandardContext.java:4782)
                  at
      org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5565)
                  at
      org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232)
                  at
      org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1575)
                  at
      org.apache.catalina.core.ContainerBase$StopChild.call(ContainerBase.java:1564)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                  at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                  at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:745)
          2014-09-29T14:22:11.384-0400: [GC 1299401K->384944K(2543104K),
      0.0427280 secs]
          64929 [localhost-startStop-2] INFO  org.apache.solr.core.SolrCore  –
      [mycore] Closing main searcher on request.
          64935 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – Closing
      NRTCachingDirectoryFactory - 3 directories currently being tracked
          69937 [explicit-fetchindex-cmd] INFO
       org.apache.solr.update.DefaultSolrCoreState  – Creating new IndexWriter...
          69938 [explicit-fetchindex-cmd] ERROR
      org.apache.solr.handler.ReplicationHandler  – SnapPull failed
      :org.apache.lucene.store.AlreadyClosedException: Already closed
                  at
      org.apache.solr.core.CachingDirectoryFactory.get(CachingDirectoryFactory.java:341)
                  at
      org.apache.solr.handler.ReplicationHandler.loadReplicationProperties(ReplicationHandler.java:827)
                  at
      org.apache.solr.handler.SnapPuller.logReplicationTimeAndConfFiles(SnapPuller.java:569)
                  at
      org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:511)
                  at
      org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:337)
                  at
      org.apache.solr.handler.ReplicationHandler$1.run(ReplicationHandler.java:227)
      
          69942 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – looking to close
      /usr/local/apache-tomcat-7.0.39/solr/mycore/data
      [CachedDir<<refCount=0;path=/usr/local/apache-tomcat-7.0.39/solr/mycore/data;done=false>>]
          69942 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – Closing directory:
      /usr/local/apache-tomcat-7.0.39/solr/mycore/data
          69942 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – looking to close
      /usr/local/apache-tomcat-7.0.39/solr/mycore/data/index
      [CachedDir<<refCount=0;path=/usr/local/apache-tomcat-7.0.39/solr/mycore/data/index;done=false>>]
          69942 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – Closing directory:
      /usr/local/apache-tomcat-7.0.39/solr/mycore/data/index
          69942 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – looking to close
      /usr/local/apache-tomcat-7.0.39/solr/mycore/data/index.20140929142203447
      [CachedDir<<refCount=0;path=/usr/local/apache-tomcat-7.0.39/solr/mycore/data/index.20140929142203447;done=true>>]
          69942 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – Closing directory:
      /usr/local/apache-tomcat-7.0.39/solr/mycore/data/index.20140929142203447
          69943 [localhost-startStop-2] INFO
       org.apache.solr.core.CachingDirectoryFactory  – Removing directory before
      core close:
      /usr/local/apache-tomcat-7.0.39/solr/mycore/data/index.20140929142203447
          Sep 29, 2014 2:22:16 PM org.apache.catalina.ha.session.DeltaManager
      stopInternal
          INFO: Manager [localhost#/solr] expiring sessions upon shutdown
          Sep 29, 2014 2:22:16 PM org.apache.catalina.loader.WebappClassLoader
      clearReferencesThreads
          SEVERE: The web application [/solr] appears to have started a thread
      named [fsyncService-12-thread-1] but has failed to stop it. This is very
      likely to create a memory leak.
          Sep 29, 2014 2:22:16 PM org.apache.catalina.loader.WebappClassLoader
      checkThreadLocalMapForLeaks
          SEVERE: The web application [/solr] created a ThreadLocal with key of
      type [org.apache.solr.schema.DateField.ThreadLocalDateFormat] (value
      [org.apache.solr.schema.DateField$ThreadLocalDateFormat@66271f62]) and a
      value of type [org.apache.solr.schema.DateField.ISO8601CanonicalDateFormat]
      (value [org.apache.solr.schema.DateField$ISO8601CanonicalDateFormat@6b2ed43a])
      but failed to remove it when the web application was stopped. Threads are
      going to be renewed over time to try and avoid a probable memory leak.
          Sep 29, 2014 2:22:16 PM org.apache.coyote.AbstractProtocol stop
          INFO: Stopping ProtocolHandler ["http-bio-8080"]
          Sep 29, 2014 2:22:16 PM org.apache.coyote.AbstractProtocol stop
          INFO: Stopping ProtocolHandler ["ajp-bio-8009"]
          Sep 29, 2014 2:22:16 PM org.apache.coyote.AbstractProtocol destroy
          INFO: Destroying ProtocolHandler ["http-bio-8080"]
          Sep 29, 2014 2:22:16 PM org.apache.coyote.AbstractProtocol destroy
          INFO: Destroying ProtocolHandler ["ajp-bio-8009"]
      

      At this point it looks like tomcat has shutdown, but the process is still
      running, as checked by ps. If I start a new tomcat and repeat this process
      a few times, we end up with several zombie instances of tomcat running.

      Is this designed behavior? What is supposed to happen when the container is
      shutdown while replication is running? How are others handling this?

      Please let me know if I can provide any more information.

      Thanks in advance,
      Nick

      My followup

      I was helping to look into this with Nick & I think we may have figured out the core of the problem...

      The problem is easily reproducible by starting replication on the slave and then sending a shutdown command to tomcat (e.g. catalina.sh stop).

      With a debugger attached, it looks like the fsyncService thread is blocking VM shutdown because it is created as a non-daemon thread.

      Essentially what seems to be happening is that the fsyncService thread is running when 'catalina.sh stop' is executed. This goes in and calls SnapPuller.destroy() which aborts the current sync. Around line 517 of the SnapPuller, there is code that is supposed to cleanup the fsyncService thread, but I don't think it is getting executed because the thread that called SnapPuller.fetchLatestIndex() is configured as a daemon Thread, so the JVM ends up shutting that down before it can cleanup the fysncService...

      So... it seems like:

          if (fsyncService != null) 
      ExecutorUtil.shutdownNowAndAwaitTermination(fsyncService);
      

      could be added around line 1706 of SnapPuller.java, or

                 // since otherwise the JVM may kill the thread before any cleanup
                 // code can be called
                puller.setDaemon(false);
      

      could be added around line 230 of ReplicationHandler.java, however this needs some additional work (and I think it might need to be added regardless) since the cleanup code in SnapPuller(around 517) that shuts down the fsync thread never gets execute since logReplicationTimeAndConfFiles() can throw IO exceptions bypassing the rest of the finally block...So the call to

      logReplicationTimeAndConfFiles()

      around line 512 would need to get wrapped with a try/catch block to catch the IO exception...

      I will attach a patch as well that I believe fixes the issues. Is there a reason to not perform a hard shutdown of the fysnc thread? It seems sane enough to me to do a hard shutdown of the fsyncService if the action is aborted...

      Attachments

        1. cleanupSnapPullerFinally.patch
          1 kB
          Philip Black-Knight

        Activity

          People

            noble.paul Noble Paul
            gprbkg Philip Black-Knight
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: