Solr
  1. Solr
  2. SOLR-5615

Deadlock while trying to recover after a ZK session expiry

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.4, 4.5, 4.6
    • Fix Version/s: 4.6.1, Trunk
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      The sequence of events which might trigger this is as follows:

      • Leader of a shard, say OL, has a ZK expiry
      • The new leader, NL, starts the election process
      • NL, through Overseer, clears the current leader (OL) for the shard from the cluster state
      • OL reconnects to ZK, calls onReconnect from event thread (main-EventThread)
      • OL marks itself down
      • OL sets up watches for cluster state, and then retrieves it (with no leader for this shard)
      • NL, through Overseer, updates cluster state to mark itself leader for the shard
      • OL tries to register itself as a replica, and waits till the cluster state is updated
        with the new leader from event thread
      • ZK sends a watch update to OL, but it is blocked on the event thread waiting for it.

      Oops. This finally breaks out after trying to register itself as replica times out after 20 mins.

      1. SOLR-5615.patch
        4 kB
        Mark Miller
      2. SOLR-5615.patch
        4 kB
        Mark Miller
      3. SOLR-5615.patch
        10 kB
        Mark Miller

        Activity

        Hide
        Ramkumar Aiyengar added a comment -

        Submitted https://github.com/apache/lucene-solr/pull/13 for one possible solution, though I am not sure if this is the right way to go about this..

        Show
        Ramkumar Aiyengar added a comment - Submitted https://github.com/apache/lucene-solr/pull/13 for one possible solution, though I am not sure if this is the right way to go about this..
        Hide
        Mark Miller added a comment -

        Not sure given the info, but the patch doesn't seem crazy to me. I've made a few adjustments in this patch.

        Show
        Mark Miller added a comment - Not sure given the info, but the patch doesn't seem crazy to me. I've made a few adjustments in this patch.
        Hide
        Ramkumar Aiyengar added a comment - - edited

        Here's some log trace which actually happened, might help understand the scenario above..

        2014-01-06 06:22:03,867 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:88] Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...
        
        // ..
        
        2014-01-06 06:22:12,529 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:103] Connection with ZooKeeper reestablished.
        
        // ..
        
        2014-01-06 06:22:36,573 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:989] publishing core=collection_20131120_shard205_replica2 state=down
        
        // ..
        
        2014-01-06 06:28:01,479 INFO [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:199] Updating cluster state from ZooKeeper... 
        2014-01-06 06:28:01,487 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:651] Register node as live in ZooKeeper:/live_nodes/host5:10750_solr
        
        // See trace above, it directly got leader props from ZK successfully, so there is actually a leader at this point contrary to what it finds below
        
        2014-01-06 06:28:01,567 INFO [main-EventThread] o.a.s.c.c.SolrZkClient [SolrZkClient.java:378] makePath: /live_nodes/host5:10750_solr
        2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:757] Register replica - core:collection_20131120_shard241_replica2 address:http://host5:10750/solr collection:collection_20131120 shard:shard241
        2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil [HttpClientUtil.java:103] Creating new http client, config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false
        
        // nothing much after this on main-EventThread for 20 mins..
        
        2014-01-06 06:54:01,786 ERROR [main-EventThread] o.a.s.c.ZkController [ZkController.java:869] Error getting leader from zk
        org.apache.solr.common.SolrException: No registered leader was found, collection:collection_20131120 slice:shard241
        
        // Then goes on to the next replica ..
        
        2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:757] Register replica - core:collection_20131120_shard209_replica2 address:http://host5:10750/solr collection:collection_20131120 shard:shard209
        2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil [HttpClientUtil.java:103] Creating new http client, config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry=false
        
        // waits another twenty mins (by which time I ordered a shutdown, so things started erroring out sooner after that)
        
        2014-01-06 07:19:21,656 ERROR [main-EventThread] o.a.s.c.ZkController [ZkController.java:869] Error getting leader from zk
        org.apache.solr.common.SolrException: No registered leader was found, collection:collection_20131120 slice:shard209
        
        // After trying to register all other replicas, these failed fast because we had ordered a shutdown already..
        
        2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:48] Reconnected to ZooKeeper
        2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:130] Connected:true
        
        // And immediately, *now* it fires all the events it was waiting for!
        
        2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:72] Watcher org.apache.solr.common.cloud.ConnectionManager@2467da0a name:ZooKeeperConnection Watcher:host1:11600,host2:11600,host3:11600 got event WatchedEvent state:Disconnected type:None path:null path:null type:None
        2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.z.ClientCnxn [ClientCnxn.java:509] EventThread shut down
        
        // many more such disc events, and then the watches
        
        2014-01-06 07:19:21,694 WARN [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:281] ZooKeeper watch triggered, but Solr cannot talk to ZK
        2014-01-06 07:19:21,694 INFO [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:210] A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 112)
        2014-01-06 07:19:21,694 WARN [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:234] ZooKeeper watch triggered, but Solr cannot talk to ZK
        
        
        Show
        Ramkumar Aiyengar added a comment - - edited Here's some log trace which actually happened, might help understand the scenario above.. 2014-01-06 06:22:03,867 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:88] Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... // .. 2014-01-06 06:22:12,529 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:103] Connection with ZooKeeper reestablished. // .. 2014-01-06 06:22:36,573 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:989] publishing core=collection_20131120_shard205_replica2 state=down // .. 2014-01-06 06:28:01,479 INFO [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:199] Updating cluster state from ZooKeeper... 2014-01-06 06:28:01,487 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:651] Register node as live in ZooKeeper:/live_nodes/host5:10750_solr // See trace above, it directly got leader props from ZK successfully, so there is actually a leader at this point contrary to what it finds below 2014-01-06 06:28:01,567 INFO [main-EventThread] o.a.s.c.c.SolrZkClient [SolrZkClient.java:378] makePath: /live_nodes/host5:10750_solr 2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:757] Register replica - core:collection_20131120_shard241_replica2 address:http: //host5:10750/solr collection:collection_20131120 shard:shard241 2014-01-06 06:28:01,669 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil [HttpClientUtil.java:103] Creating new http client, config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry= false // nothing much after this on main-EventThread for 20 mins.. 2014-01-06 06:54:01,786 ERROR [main-EventThread] o.a.s.c.ZkController [ZkController.java:869] Error getting leader from zk org.apache.solr.common.SolrException: No registered leader was found, collection:collection_20131120 slice:shard241 // Then goes on to the next replica .. 2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.ZkController [ZkController.java:757] Register replica - core:collection_20131120_shard209_replica2 address:http: //host5:10750/solr collection:collection_20131120 shard:shard209 2014-01-06 06:54:01,786 INFO [main-EventThread] o.a.s.c.s.i.HttpClientUtil [HttpClientUtil.java:103] Creating new http client, config:maxConnections=10000&maxConnectionsPerHost=20&connTimeout=30000&socketTimeout=30000&retry= false // waits another twenty mins (by which time I ordered a shutdown, so things started erroring out sooner after that) 2014-01-06 07:19:21,656 ERROR [main-EventThread] o.a.s.c.ZkController [ZkController.java:869] Error getting leader from zk org.apache.solr.common.SolrException: No registered leader was found, collection:collection_20131120 slice:shard209 // After trying to register all other replicas, these failed fast because we had ordered a shutdown already.. 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.DefaultConnectionStrategy [DefaultConnectionStrategy.java:48] Reconnected to ZooKeeper 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:130] Connected: true // And immediately, *now* it fires all the events it was waiting for ! 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.s.c.c.ConnectionManager [ConnectionManager.java:72] Watcher org.apache.solr.common.cloud.ConnectionManager@2467da0a name:ZooKeeperConnection Watcher:host1:11600,host2:11600,host3:11600 got event WatchedEvent state:Disconnected type:None path: null path: null type:None 2014-01-06 07:19:21,693 INFO [main-EventThread] o.a.z.ClientCnxn [ClientCnxn.java:509] EventThread shut down // many more such disc events, and then the watches 2014-01-06 07:19:21,694 WARN [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:281] ZooKeeper watch triggered, but Solr cannot talk to ZK 2014-01-06 07:19:21,694 INFO [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:210] A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 112) 2014-01-06 07:19:21,694 WARN [main-EventThread] o.a.s.c.c.ZkStateReader [ZkStateReader.java:234] ZooKeeper watch triggered, but Solr cannot talk to ZK
        Hide
        Mark Miller added a comment -

        Thanks, perfect.

        Show
        Mark Miller added a comment - Thanks, perfect.
        Hide
        Mark Miller added a comment -

        Okay, now it's more clear to me. We need to run onReconnect in a background thread I think.

        Show
        Mark Miller added a comment - Okay, now it's more clear to me. We need to run onReconnect in a background thread I think.
        Hide
        Ramkumar Aiyengar added a comment -

        That, incidentally, was my first attempt at a fix! (Should have a diff..) However, onReconnect in any case runs in the event thread of the expired ZK which wouldn't have events after that, so it's effectively backgrounded? It should still work as a solution I guess..

        Show
        Ramkumar Aiyengar added a comment - That, incidentally, was my first attempt at a fix! (Should have a diff..) However, onReconnect in any case runs in the event thread of the expired ZK which wouldn't have events after that, so it's effectively backgrounded? It should still work as a solution I guess..
        Hide
        Mark Miller added a comment -

        However, onReconnect in any case runs in the event thread of the expired ZK which wouldn't have events after that, so it's effectively backgrounded?

        But it holds the ConnectionManager this lock while it runs right? I think we just don't want to hold that lock while it runs.

        I think the other changes are likely okay too - I'm playing around with a combination of the two.

        Show
        Mark Miller added a comment - However, onReconnect in any case runs in the event thread of the expired ZK which wouldn't have events after that, so it's effectively backgrounded? But it holds the ConnectionManager this lock while it runs right? I think we just don't want to hold that lock while it runs. I think the other changes are likely okay too - I'm playing around with a combination of the two.
        Hide
        Mark Miller added a comment -

        Another rev.

        Show
        Mark Miller added a comment - Another rev.
        Hide
        Mark Miller added a comment -

        Even with the other changes, I like the idea of using a background thread because I don't think it's right that we do that whole reconnect process before we set that we are connected to zk and get out of the connection manager. I really don't think that process should hold up the connection manager at all - it's meant to just trigger it.

        Show
        Mark Miller added a comment - Even with the other changes, I like the idea of using a background thread because I don't think it's right that we do that whole reconnect process before we set that we are connected to zk and get out of the connection manager. I really don't think that process should hold up the connection manager at all - it's meant to just trigger it.
        Hide
        Ramkumar Aiyengar added a comment -

        Fair enough. Would that allow multiple onReconnect.command () invocations to run simultaneously, and is that fine? (on mobile, so my reading of the patch could be wrong) What if we were in the process of recovering when we were unfortunate enough to get a second expiry thereby bringing all nodes down?

        Show
        Ramkumar Aiyengar added a comment - Fair enough. Would that allow multiple onReconnect.command () invocations to run simultaneously, and is that fine? (on mobile, so my reading of the patch could be wrong) What if we were in the process of recovering when we were unfortunate enough to get a second expiry thereby bringing all nodes down?
        Hide
        Mark Miller added a comment -

        Yeah, I've been considered the same thing. My inclination was it was okay, but we may have to add something to cancel our leader election before joining the election to be sure.

        Show
        Mark Miller added a comment - Yeah, I've been considered the same thing. My inclination was it was okay, but we may have to add something to cancel our leader election before joining the election to be sure.
        Hide
        Mark Miller added a comment -

        Another rev that adds what I think is a decent change anyway - before joining an election, cancel any known previous election participation.

        Show
        Mark Miller added a comment - Another rev that adds what I think is a decent change anyway - before joining an election, cancel any known previous election participation.
        Hide
        ASF subversion and git services added a comment -

        Commit 1556572 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1556572 ]

        SOLR-5615: Deadlock while trying to recover after a ZK session expiration.

        Show
        ASF subversion and git services added a comment - Commit 1556572 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1556572 ] SOLR-5615 : Deadlock while trying to recover after a ZK session expiration.
        Hide
        ASF subversion and git services added a comment -

        Commit 1556573 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1556573 ]

        SOLR-5615: Deadlock while trying to recover after a ZK session expiration.

        Show
        ASF subversion and git services added a comment - Commit 1556573 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1556573 ] SOLR-5615 : Deadlock while trying to recover after a ZK session expiration.
        Hide
        Ramkumar Aiyengar added a comment -

        Had noticed a separate race while initially investigating running onReconnect in a separate thread, https://github.com/apache/lucene-solr/pull/15 for that.

        Show
        Ramkumar Aiyengar added a comment - Had noticed a separate race while initially investigating running onReconnect in a separate thread, https://github.com/apache/lucene-solr/pull/15 for that.
        Hide
        Mark Miller added a comment -

        Thanks, we should open a new JIRA issue for that.

        Show
        Mark Miller added a comment - Thanks, we should open a new JIRA issue for that.
        Hide
        Ramkumar Aiyengar added a comment -
        Show
        Ramkumar Aiyengar added a comment - SOLR-5620
        Hide
        Shawn Heisey added a comment -

        Noted while backporting SOLR-5543 to the 4.6 branch: In the trunk CHANGES.txt file for trunk, this issue number shows up in the 4.6.1 section, but does not appear to have been actually backported to the 4.6 branch yet.

        Show
        Shawn Heisey added a comment - Noted while backporting SOLR-5543 to the 4.6 branch: In the trunk CHANGES.txt file for trunk, this issue number shows up in the 4.6.1 section, but does not appear to have been actually backported to the 4.6 branch yet.
        Hide
        Mark Miller added a comment -

        Yeah, I started it, but turns out it's difficult without backporting another fix first.

        Show
        Mark Miller added a comment - Yeah, I started it, but turns out it's difficult without backporting another fix first.
        Hide
        ASF subversion and git services added a comment -

        Commit 1557811 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
        [ https://svn.apache.org/r1557811 ]

        SOLR-5615: Deadlock while trying to recover after a ZK session expiration.

        Show
        ASF subversion and git services added a comment - Commit 1557811 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1557811 ] SOLR-5615 : Deadlock while trying to recover after a ZK session expiration.
        Hide
        ASF subversion and git services added a comment -

        Commit 1557812 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
        [ https://svn.apache.org/r1557812 ]

        SOLR-5615: Add CHANGES entry to 4.6 branch

        Show
        ASF subversion and git services added a comment - Commit 1557812 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1557812 ] SOLR-5615 : Add CHANGES entry to 4.6 branch
        Hide
        ASF subversion and git services added a comment -

        Commit 1557814 from Mark Miller in branch 'dev/branches/lucene_solr_4_6'
        [ https://svn.apache.org/r1557814 ]

        SOLR-5615: Fix CHANGES mismerge

        Show
        ASF subversion and git services added a comment - Commit 1557814 from Mark Miller in branch 'dev/branches/lucene_solr_4_6' [ https://svn.apache.org/r1557814 ] SOLR-5615 : Fix CHANGES mismerge
        Hide
        Mark Miller added a comment -

        Bulk change via releasing 4.6.1 in JIRA. Any issues that were not resolved against 4.6.1 should be. I have not gone through the generated emails yet.

        Show
        Mark Miller added a comment - Bulk change via releasing 4.6.1 in JIRA. Any issues that were not resolved against 4.6.1 should be. I have not gone through the generated emails yet.
        Hide
        ASF GitHub Bot added a comment -

        Github user andyetitmoves closed the pull request at:

        https://github.com/apache/lucene-solr/pull/13

        Show
        ASF GitHub Bot added a comment - Github user andyetitmoves closed the pull request at: https://github.com/apache/lucene-solr/pull/13

          People

          • Assignee:
            Mark Miller
            Reporter:
            Ramkumar Aiyengar
          • Votes:
            2 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development