Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-4161

deadlock if commit+newSearcher occurs during core close, can happen as a result of snappuller (occured in TestReplicationHandler)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      There appears to be a lock related bug in the DefaultSolrCoreState/DirectUpdateHandler2 interactions. It appears that if CoreContainer is shutting down the core at the same time that some other thread attempts to do a commit which triggers a newSearcher, then DefaultSolrCoreState.closeIndexWriter and DefaultSolrCoreState.getIndexWriter get into deadlock.

      This has been observed in TestReplicationHandler, but doesn't appear to be related to any bugs in thta testcase, so it seems like it could easily affect real life users as well.

      Summary of the deadlock stacks, see attachments for full details...

      Found one Java-level deadlock:
      =============================
      "snapPuller-422-thread-1":
        waiting to lock monitor 0x00007f5a2011a9e0 (object 0x00000000f5f485a0, a org.apache.solr.update.DefaultSolrCoreState),
        which is held by "TEST-TestReplicationHandler.test-seed#[1B46F52130C14E03]"
      "TEST-TestReplicationHandler.test-seed#[1B46F52130C14E03]":
        waiting for ownable synchronizer 0x00000000f60fe5c8, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
        which is held by "snapPuller-422-thread-1"
      
      Java stack information for the threads listed above:
      ===================================================
      "snapPuller-422-thread-1":
      	at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:77)
      	- waiting to lock <0x00000000f5f485a0> (a org.apache.solr.update.DefaultSolrCoreState)
      	at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1358)
      	at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:561)
      	- locked <0x00000000f5f485d0> (a java.lang.Object)
      	at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:655)
      	at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:454)
      	at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:273)
      	at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:222)
      ...
      
      "TEST-TestReplicationHandler.test-seed#[1B46F52130C14E03]":
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x00000000f60fe5c8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:871)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1201)
      	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
      	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
      	at org.apache.solr.update.DirectUpdateHandler2.closeWriter(DirectUpdateHandler2.java:668)
      	at org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:64)
      	- locked <0x00000000f5f485a0> (a org.apache.solr.update.DefaultSolrCoreState)
      	at org.apache.solr.update.DefaultSolrCoreState.close(DefaultSolrCoreState.java:259)
      	- locked <0x00000000f5f485a0> (a org.apache.solr.update.DefaultSolrCoreState)
      	at org.apache.solr.core.SolrCore.decrefSolrCoreState(SolrCore.java:879)
      	- locked <0x00000000f5f485a0> (a org.apache.solr.update.DefaultSolrCoreState)
      	at org.apache.solr.core.SolrCore.close(SolrCore.java:971)
      	at org.apache.solr.core.CoreContainer.shutdown(CoreContainer.java:723)
      
      

      Original Report...

      while testing out another patch i noticed "stalled" heartbeat messages getting logged by TestReplicationHandler.test and started taking some stack traces to see if it was in the code i was working on.

      it's not, so i suspect it's unrelated to the changes i'm looking at, but i did notice that there was a full on deadlock reported, so i wanted to make sure it got tracked.

      Attachments

        1. dump2.txt
          44 kB
          Chris M. Hostetter
        2. dump3.txt
          44 kB
          Chris M. Hostetter
        3. dump4.txt
          44 kB
          Chris M. Hostetter
        4. dump5.txt
          44 kB
          Chris M. Hostetter

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hossman Chris M. Hostetter
              Votes:
              3 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: