Solr
  1. Solr
  2. SOLR-4960

race condition in CoreContainer.shutdown leads to double closes on cores

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3
    • Fix Version/s: 4.4, Trunk
    • Component/s: None
    • Labels:
      None

      Description

      CoreContainer.shutdown has a race condition that can lead to a closed (or closing) core being handed out to an incoming request. This can further lead to SolrCore.close() logic being executed again when the request is finished.

      This bug was introduced in SOLR-4196 r1451797

      1. SOLR-4960.patch
        2 kB
        Yonik Seeley
      2. SOLR-4960_getCore.patch
        3 kB
        Yonik Seeley

        Issue Links

          Activity

          Yonik Seeley created issue -
          Yonik Seeley made changes -
          Field Original Value New Value
          Link This issue is broken by SOLR-4196 [ SOLR-4196 ]
          Yonik Seeley made changes -
          Assignee Yonik Seeley [ yseeley@gmail.com ]
          Hide
          Yonik Seeley added a comment -

          Here's a patch that fixes things for normal cores - I didn't touch the transient handling.

          Show
          Yonik Seeley added a comment - Here's a patch that fixes things for normal cores - I didn't touch the transient handling.
          Yonik Seeley made changes -
          Attachment SOLR-4960.patch [ 12589616 ]
          Hide
          Yonik Seeley added a comment -

          There's also a race condition in getCore itself (there's no synchronization between the core lookup and incrementing the ref count, so we could hand out a closing/closed core. Looks like it was introduced by SOLR-4196 as well.
          I'll just use this issue to fix this bug also.

          Show
          Yonik Seeley added a comment - There's also a race condition in getCore itself (there's no synchronization between the core lookup and incrementing the ref count, so we could hand out a closing/closed core. Looks like it was introduced by SOLR-4196 as well. I'll just use this issue to fix this bug also.
          Hide
          Yonik Seeley added a comment -

          Here's a patch to fix the race in CoreContainer.getCore()

          I changed the signature of SolrCores.getCoreFromAnyList to accept a boolean to increment the reference count and only fixed getCore to use it. There may still be other races due to the transient core feature.

          Show
          Yonik Seeley added a comment - Here's a patch to fix the race in CoreContainer.getCore() I changed the signature of SolrCores.getCoreFromAnyList to accept a boolean to increment the reference count and only fixed getCore to use it. There may still be other races due to the transient core feature.
          Yonik Seeley made changes -
          Attachment SOLR-4960_getCore.patch [ 12589651 ]
          Hide
          Erick Erickson added a comment -

          Yonik Seeley What's the state of all this? These patches look like they're on trunk but not 4x. When I looked at them I realized that the transient and pending lists could be handled the same way (actually in a single list) which simplifies things.

          I'll open up a new JIRA for my additions, but we need to merge these changes into 4x before I deal with the next patch....

          Show
          Erick Erickson added a comment - Yonik Seeley What's the state of all this? These patches look like they're on trunk but not 4x. When I looked at them I realized that the transient and pending lists could be handled the same way (actually in a single list) which simplifies things. I'll open up a new JIRA for my additions, but we need to merge these changes into 4x before I deal with the next patch....
          Erick Erickson made changes -
          Link This issue relates to SOLR-4974 [ SOLR-4974 ]
          Hide
          Erick Erickson added a comment -

          OK, I'll run a few more tests, then merge these two patches back into the 4x code line and then apply SOLR-4794 to the whole lot.

          Show
          Erick Erickson added a comment - OK, I'll run a few more tests, then merge these two patches back into the 4x code line and then apply SOLR-4794 to the whole lot.
          Hide
          Erick Erickson added a comment -

          Several revisions all told

          trunk: 1496546, 1496620 and 1497999
          4x: 1498010

          Show
          Erick Erickson added a comment - Several revisions all told trunk: 1496546, 1496620 and 1497999 4x: 1498010
          Erick Erickson made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 5.0 [ 12321664 ]
          Resolution Fixed [ 1 ]
          Hide
          Steve Rowe added a comment -

          Bulk close resolved 4.4 issues

          Show
          Steve Rowe added a comment - Bulk close resolved 4.4 issues
          Steve Rowe made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Veera added a comment -

          Hi there

          I am running 4.3.1 solrcloud and seeing a few of my cores dying with this exception (during reads, writes, recovery etc)

          org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: SolrCoreState already closed
          solr.log.117-3616225- at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:84)
          solr.log.117-3616318- at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:524)

          Could this bug be the root cause of this issue?

          Show
          Veera added a comment - Hi there I am running 4.3.1 solrcloud and seeing a few of my cores dying with this exception (during reads, writes, recovery etc) org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: SolrCoreState already closed solr.log.117-3616225- at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:84) solr.log.117-3616318- at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:524) Could this bug be the root cause of this issue?
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          4d 4h 21m 1 Erick Erickson 29/Jun/13 20:27
          Resolved Resolved Closed Closed
          23d 23h 10m 1 Steve Rowe 23/Jul/13 19:38

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development