Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-9847

Deadlock on ManagedIndexSchema lock.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4, 7.0
    • Component/s: None
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
      None

      Description

      Seems we hold the lock while in ManagedIndexSchema.waitForSchemaZkVersionAgreement, so we may never see agreement because are updates may also be waiting on that lock.

        Activity

        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Not a complete deadlock I guess, but takes a long time to sort out:

           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:27:14, stalled for  190s at: TestBulkSchemaConcurrent.test
           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:28:14, stalled for  250s at: TestBulkSchemaConcurrent.test
           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:29:14, stalled for  310s at: TestBulkSchemaConcurrent.test
           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:30:14, stalled for  370s at: TestBulkSchemaConcurrent.test
           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:31:14, stalled for  430s at: TestBulkSchemaConcurrent.test
           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:32:14, stalled for  490s at: TestBulkSchemaConcurrent.test
           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:33:14, stalled for  550s at: TestBulkSchemaConcurrent.test
           [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:34:14, stalled for  610s at: TestBulkSchemaConcurrent.test
           [junit4] Suite: org.apache.solr.schema.TestBulkSchemaConcurrent
        
        Show
        markrmiller@gmail.com Mark Miller added a comment - Not a complete deadlock I guess, but takes a long time to sort out: [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:27:14, stalled for 190s at: TestBulkSchemaConcurrent.test [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:28:14, stalled for 250s at: TestBulkSchemaConcurrent.test [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:29:14, stalled for 310s at: TestBulkSchemaConcurrent.test [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:30:14, stalled for 370s at: TestBulkSchemaConcurrent.test [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:31:14, stalled for 430s at: TestBulkSchemaConcurrent.test [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:32:14, stalled for 490s at: TestBulkSchemaConcurrent.test [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:33:14, stalled for 550s at: TestBulkSchemaConcurrent.test [junit4] HEARTBEAT J9 PID(4929@totalmetal): 2016-12-11T21:34:14, stalled for 610s at: TestBulkSchemaConcurrent.test [junit4] Suite: org.apache.solr.schema.TestBulkSchemaConcurrent
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        I suppose that eventually happens because of:

        // eventually, this loop will get killed by the ExecutorService's timeout

        Show
        markrmiller@gmail.com Mark Miller added a comment - I suppose that eventually happens because of: // eventually, this loop will get killed by the ExecutorService's timeout
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        I don't know this code so well, but here is one idea.

        Show
        markrmiller@gmail.com Mark Miller added a comment - I don't know this code so well, but here is one idea.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        The stacks:

        qtp1112580055-3894 [WAITING]
        sun.misc.Unsafe.park(boolean, long) Unsafe.java (native)
        java.util.concurrent.locks.LockSupport.parkNanos(Object, long) LockSupport.java:215
        java.util.concurrent.FutureTask.awaitDone(boolean, long) FutureTask.java:426
        java.util.concurrent.FutureTask.get(long, TimeUnit) FutureTask.java:204
        java.util.concurrent.AbstractExecutorService.invokeAll(Collection, long, TimeUnit) AbstractExecutorService.java:289
        org.apache.solr.schema.ManagedIndexSchema.waitForSchemaZkVersionAgreement(String, String, int, ZkController, int) ManagedIndexSchema.java:238
        org.apache.solr.schema.SchemaManager.waitForOtherReplicasToUpdate(TimeOut, int) SchemaManager.java:167
        org.apache.solr.schema.SchemaManager.doOperations(List) SchemaManager.java:137
        org.apache.solr.schema.SchemaManager.performOperations(Reader) SchemaManager.java:92
        org.apache.solr.handler.SchemaHandler.handleRequestBody(SolrQueryRequest, SolrQueryResponse) SchemaHandler.java:91
        org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) RequestHandlerBase.java:152
        org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) SolrCore.java:2227
        
        zkCallback-535-thread-2-processing-n:127.0.0.1:34881_c_c [BLOCKED]
        org.apache.solr.schema.ZkIndexSchemaReader.updateSchema(Watcher, int) ZkIndexSchemaReader.java:131
        org.apache.solr.schema.ZkIndexSchemaReader.access$200(ZkIndexSchemaReader, Watcher, int) ZkIndexSchemaReader.java:39
        org.apache.solr.schema.ZkIndexSchemaReader$2.process(WatchedEvent) ZkIndexSchemaReader.java:97
        org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(Watcher, WatchedEvent) SolrZkClient.java:268
        org.apache.solr.common.cloud.SolrZkClient$3$$Lambda$62.run()
        java.util.concurrent.Executors$RunnableAdapter.call() Executors.java:511
        java.util.concurrent.FutureTask.run() FutureTask.java:266
        org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ArrayList, List, Map, String, Runnable, Exception) ExecutorUtil.java:229
        org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$40.run()
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1142
        java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:617
        java.lang.Thread.run() Thread.java:745
        
        zkCallback-522-thread-2-processing-n:127.0.0.1:39199_c_c [BLOCKED]
        org.apache.solr.schema.ZkIndexSchemaReader.updateSchema(Watcher, int) ZkIndexSchemaReader.java:131
        org.apache.solr.schema.ZkIndexSchemaReader.access$200(ZkIndexSchemaReader, Watcher, int) ZkIndexSchemaReader.java:39
        org.apache.solr.schema.ZkIndexSchemaReader$2.process(WatchedEvent) ZkIndexSchemaReader.java:97
        org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(Watcher, WatchedEvent) SolrZkClient.java:268
        org.apache.solr.common.cloud.SolrZkClient$3$$Lambda$62.run()
        java.util.concurrent.Executors$RunnableAdapter.call() Executors.java:511
        java.util.concurrent.FutureTask.run() FutureTask.java:266
        org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ArrayList, List, Map, String, Runnable, Exception) ExecutorUtil.java:229
        org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$40.run()
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1142
        java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:617
        java.lang.Thread.run() Thread.java:745
        
        Show
        markrmiller@gmail.com Mark Miller added a comment - The stacks: qtp1112580055-3894 [WAITING] sun.misc.Unsafe.park(boolean, long) Unsafe.java (native) java.util.concurrent.locks.LockSupport.parkNanos(Object, long) LockSupport.java:215 java.util.concurrent.FutureTask.awaitDone(boolean, long) FutureTask.java:426 java.util.concurrent.FutureTask.get(long, TimeUnit) FutureTask.java:204 java.util.concurrent.AbstractExecutorService.invokeAll(Collection, long, TimeUnit) AbstractExecutorService.java:289 org.apache.solr.schema.ManagedIndexSchema.waitForSchemaZkVersionAgreement(String, String, int, ZkController, int) ManagedIndexSchema.java:238 org.apache.solr.schema.SchemaManager.waitForOtherReplicasToUpdate(TimeOut, int) SchemaManager.java:167 org.apache.solr.schema.SchemaManager.doOperations(List) SchemaManager.java:137 org.apache.solr.schema.SchemaManager.performOperations(Reader) SchemaManager.java:92 org.apache.solr.handler.SchemaHandler.handleRequestBody(SolrQueryRequest, SolrQueryResponse) SchemaHandler.java:91 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) RequestHandlerBase.java:152 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) SolrCore.java:2227 zkCallback-535-thread-2-processing-n:127.0.0.1:34881_c_c [BLOCKED] org.apache.solr.schema.ZkIndexSchemaReader.updateSchema(Watcher, int) ZkIndexSchemaReader.java:131 org.apache.solr.schema.ZkIndexSchemaReader.access$200(ZkIndexSchemaReader, Watcher, int) ZkIndexSchemaReader.java:39 org.apache.solr.schema.ZkIndexSchemaReader$2.process(WatchedEvent) ZkIndexSchemaReader.java:97 org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(Watcher, WatchedEvent) SolrZkClient.java:268 org.apache.solr.common.cloud.SolrZkClient$3$$Lambda$62.run() java.util.concurrent.Executors$RunnableAdapter.call() Executors.java:511 java.util.concurrent.FutureTask.run() FutureTask.java:266 org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ArrayList, List, Map, String, Runnable, Exception) ExecutorUtil.java:229 org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$40.run() java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1142 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:617 java.lang.Thread.run() Thread.java:745 zkCallback-522-thread-2-processing-n:127.0.0.1:39199_c_c [BLOCKED] org.apache.solr.schema.ZkIndexSchemaReader.updateSchema(Watcher, int) ZkIndexSchemaReader.java:131 org.apache.solr.schema.ZkIndexSchemaReader.access$200(ZkIndexSchemaReader, Watcher, int) ZkIndexSchemaReader.java:39 org.apache.solr.schema.ZkIndexSchemaReader$2.process(WatchedEvent) ZkIndexSchemaReader.java:97 org.apache.solr.common.cloud.SolrZkClient$3.lambda$process$0(Watcher, WatchedEvent) SolrZkClient.java:268 org.apache.solr.common.cloud.SolrZkClient$3$$Lambda$62.run() java.util.concurrent.Executors$RunnableAdapter.call() Executors.java:511 java.util.concurrent.FutureTask.run() FutureTask.java:266 org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ArrayList, List, Map, String, Runnable, Exception) ExecutorUtil.java:229 org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$40.run() java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1142 java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:617 java.lang.Thread.run() Thread.java:745
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Steve Rowe, any thoughts on this one? It has bit me quite a lot locally over the months.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Steve Rowe , any thoughts on this one? It has bit me quite a lot locally over the months.
        Hide
        steve_rowe Steve Rowe added a comment -

        I see this failure rarely on my Jenkins - like once every couple months.

        Mark Miller, have you noticed whether the SOLR-9832 changes impacted frequency of this failure?

        I'll start beasting to see whether I can get this failure locally.

        Show
        steve_rowe Steve Rowe added a comment - I see this failure rarely on my Jenkins - like once every couple months. Mark Miller , have you noticed whether the SOLR-9832 changes impacted frequency of this failure? I'll start beasting to see whether I can get this failure locally.
        Hide
        steve_rowe Steve Rowe added a comment -

        I beasted TestBulkSchemaConcurrent for 500 iterations with and without the patch, and the patch doesn't seem to change anything: unpatched, 27/500 iterations had the 600 second timeout; and patched, 28/500 iterations did. Mark Miller, did your patch improve things for you locally?

        A different idea: we could release the update lock after local changes are made, but before waiting for all replicas to receive the update. I'll try this out.

        Show
        steve_rowe Steve Rowe added a comment - I beasted TestBulkSchemaConcurrent for 500 iterations with and without the patch, and the patch doesn't seem to change anything: unpatched, 27/500 iterations had the 600 second timeout; and patched, 28/500 iterations did. Mark Miller , did your patch improve things for you locally? A different idea: we could release the update lock after local changes are made, but before waiting for all replicas to receive the update. I'll try this out.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Sorry, I have not yet had a lot of time on a branch with the test fix. Been meaning to get back to it. No evidence it works yet, just a thought looking at the deadlock stack traces.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Sorry, I have not yet had a lot of time on a branch with the test fix. Been meaning to get back to it. No evidence it works yet, just a thought looking at the deadlock stack traces.
        Hide
        steve_rowe Steve Rowe added a comment -

        Patch that narrows the period during which the schema update lock is held to exclude waiting for other replicas to receive the update. With this patch, there were no stalls when I beasted TestBulkSchemaConcurrent for 500 iterations (versus 5% of the time unpatched and with Mark Miller's patch on this issue). Running all Solr tests now.

        Show
        steve_rowe Steve Rowe added a comment - Patch that narrows the period during which the schema update lock is held to exclude waiting for other replicas to receive the update. With this patch, there were no stalls when I beasted TestBulkSchemaConcurrent for 500 iterations (versus 5% of the time unpatched and with Mark Miller 's patch on this issue). Running all Solr tests now.
        Hide
        steve_rowe Steve Rowe added a comment -

        Running all Solr tests now.

        All Solr tests passed for me with my alternate patch, as does precommit. Mark Miller: does it look good to commit?

        Show
        steve_rowe Steve Rowe added a comment - Running all Solr tests now. All Solr tests passed for me with my alternate patch, as does precommit. Mark Miller : does it look good to commit?
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Patch that narrows the period during which the schema update lock is held

        That was my first inclination, but I don't know how all this works that well yet, so I tried to hack something sim with the wait.

        does it look good to commit?

        +1. I'll report if I run into anything further.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Patch that narrows the period during which the schema update lock is held That was my first inclination, but I don't know how all this works that well yet, so I tried to hack something sim with the wait. does it look good to commit? +1. I'll report if I run into anything further.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit cfbfc13209c57e1c76bd478c3b584135b5d109eb in lucene-solr's branch refs/heads/branch_6x from Steve Rowe
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cfbfc13 ]

        SOLR-9847: Stop blocking further schema updates while waiting for a pending update to propagate to other replicas. This reduces the likelihood of a (time-limited) distributed deadlock during concurrent schema updates.

        Show
        jira-bot ASF subversion and git services added a comment - Commit cfbfc13209c57e1c76bd478c3b584135b5d109eb in lucene-solr's branch refs/heads/branch_6x from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=cfbfc13 ] SOLR-9847 : Stop blocking further schema updates while waiting for a pending update to propagate to other replicas. This reduces the likelihood of a (time-limited) distributed deadlock during concurrent schema updates.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 04108d993574537d9623d5d1bf2658cafad12ef8 in lucene-solr's branch refs/heads/master from Steve Rowe
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=04108d9 ]

        SOLR-9847: Stop blocking further schema updates while waiting for a pending update to propagate to other replicas. This reduces the likelihood of a (time-limited) distributed deadlock during concurrent schema updates.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 04108d993574537d9623d5d1bf2658cafad12ef8 in lucene-solr's branch refs/heads/master from Steve Rowe [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=04108d9 ] SOLR-9847 : Stop blocking further schema updates while waiting for a pending update to propagate to other replicas. This reduces the likelihood of a (time-limited) distributed deadlock during concurrent schema updates.
        Hide
        steve_rowe Steve Rowe added a comment -

        Thanks Mark.

        Show
        steve_rowe Steve Rowe added a comment - Thanks Mark.

          People

          • Assignee:
            steve_rowe Steve Rowe
            Reporter:
            markrmiller@gmail.com Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development