Solr
  1. Solr
  2. SOLR-7844

Zookeeper session expiry during shard leader election can cause multiple leaders.

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.10.4
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      If the ZooKeeper session expires for a host during shard leader election, the ephemeral leader_elect nodes are removed. However the threads that were processing the election are still present (and could believe the host won the election). They will then incorrectly create leader nodes once a new ZooKeeper session is established.

      This introduces a subtle race condition that could cause two hosts to become leader.

      Scenario:

      a three machine cluster, all of the machines are restarting at approximately the same time.

      The first machine starts, writes a leader_elect ephemeral node, it's the only candidate in the election so it wins and starts the leadership process. As it knows it has peers, it begins to block waiting for the peers to arrive.

      During this period of blocking[1] the ZK connection drops and the session expires.

      A new ZK session is established, and ElectionContext.cancelElection is called. Then register() is called and a new set of leader_elect ephemeral nodes are created.

      During the period between the ZK session expiring, and new set of leader_elect nodes being created the second machine starts.

      It creates its leader_elect ephemeral nodes, as there are no other nodes it wins the election and starts the leadership process. As its still missing one of its peers, it begins to block waiting for the third machine to join.

      There is now a race between machine1 & machine2, both of whom think they are the leader.

      So far, this isn't too bad, because the machine that loses the race will fail when it tries to create the /collection/name/leader/shard1 node (as it already exists), and will rejoin the election.

      While this is happening, machine3 has started and has queued for leadership behind machine2.

      If the loser of the race is machine2, when it rejoins the election it cancels the current context, deleting it's leader_elect ephemeral nodes.

      At this point, machine3 believes it has become leader (the watcher it has on the leader_elect node fires), and it runs the LeaderElector::checkIfIAmLeader method. This method DELETES the current /collection/name/leader/shard1 node, then starts the leadership process (as all three machines are now running, it does not block to wait).

      So, machine1 won the race with machine2 and declared its leadership and created the nodes. However, machine3 has just deleted them, and recreated them for itself. So machine1 and machine3 both believe they are the leader.

      I am thinking that the fix should be to cancel & close all election contexts immediately on reconnect (we do cancel them, however it's run serially which has blocking issues, and just canceling does not cause the wait loop to exit). That election context logic already has checks on the closed flag, so they should exit if they see it has been closed.

      I'm working on a patch for this.

      1. SOLR-7844.patch
        35 kB
        Mark Miller
      2. SOLR-7844.patch
        35 kB
        Mark Miller
      3. SOLR-7844.patch
        35 kB
        Mark Miller
      4. SOLR-7844.patch
        21 kB
        Mark Miller
      5. SOLR-7844.patch
        17 kB
        Mark Miller
      6. SOLR-7844.patch
        12 kB
        Mark Miller
      7. SOLR-7844.patch
        10 kB
        Mark Miller
      8. SOLR-7844.patch
        8 kB
        Mark Miller
      9. SOLR-7844.patch
        4 kB
        Mike Roberts
      10. SOLR-7844-5x.patch
        37 kB
        Mark Miller

        Activity

        Hide
        Mike Roberts added a comment -

        Initial attempt at a patch.

        When we reconnect after a zookeeper session expiry we cancel any ongoing leader elections. The current ongoing election threads check status in a few places to ensure that no 'leader specific' operations are taken after the election has been canceled.

        I don't think this is bulletproof, but it does significantly reduce the size of the window in which this scenario can occur.

        Show
        Mike Roberts added a comment - Initial attempt at a patch. When we reconnect after a zookeeper session expiry we cancel any ongoing leader elections. The current ongoing election threads check status in a few places to ensure that no 'leader specific' operations are taken after the election has been canceled. I don't think this is bulletproof, but it does significantly reduce the size of the window in which this scenario can occur.
        Hide
        Mark Miller added a comment -

        Did you see this in the wild? Can you reproduce it?

        Any chance you can port this patch to trunk?

        This looks like a good improvement.

        This method DELETES the current /collection/name/leader/shard1 node

        We need to audit that. We should be using that node to ensure only one leader in any case.

        Show
        Mark Miller added a comment - Did you see this in the wild? Can you reproduce it? Any chance you can port this patch to trunk? This looks like a good improvement. This method DELETES the current /collection/name/leader/shard1 node We need to audit that. We should be using that node to ensure only one leader in any case.
        Hide
        Mark Miller added a comment -

        We really want to remove the following and count on SOLR-5799. This is really a mistake. This patch probably also still makes sense as well though.

               // first we delete the node advertising the old leader in case the ephem is still there
              try {
                zkClient.delete(context.leaderPath, -1, true);
              }catch (KeeperException.NoNodeException nne){
                //no problem
              }catch (InterruptedException e){
                throw e;
              } catch (Exception e) {
                //failed to delete the leader node
                log.error("leader elect delete error",e);
                retryElection(context, false);
                return;
                // fine
              } 
        Show
        Mark Miller added a comment - We really want to remove the following and count on SOLR-5799 . This is really a mistake. This patch probably also still makes sense as well though. // first we delete the node advertising the old leader in case the ephem is still there try { zkClient.delete(context.leaderPath, -1, true ); } catch (KeeperException.NoNodeException nne){ //no problem } catch (InterruptedException e){ throw e; } catch (Exception e) { //failed to delete the leader node log.error( "leader elect delete error" ,e); retryElection(context, false ); return ; // fine }
        Hide
        Mark Miller added a comment -

        Here is a patch merging everything together for trunk.

        We no longer delete the advertised leader node - too dangerous. We also do better at stopping elections from living across session expiration reconnects.

        Show
        Mark Miller added a comment - Here is a patch merging everything together for trunk. We no longer delete the advertised leader node - too dangerous. We also do better at stopping elections from living across session expiration reconnects.
        Hide
        Jessica Cheng Mallet added a comment -

        Mark Miller, not sure if this is intended--looks like the newly added ShardLeaderElectionContextBase.cancelElection now blindly deletes the leader node, which sounds just as dangerous. From your comment it seems like you just wanted it to expire out, so I'm wondering if it's just a merge bug or something.

        In general, I think it'd make a lot of sense to predicate the writing of the leader node on the election node still having the same session as the thread thinks (using the same zookeeper multi-transactional semantics as in ZkController.markShardAsDownIfLeader), so that a thread that went GCing before writing the leader node will fail when it comes back since its election node will have expired.

        Show
        Jessica Cheng Mallet added a comment - Mark Miller , not sure if this is intended--looks like the newly added ShardLeaderElectionContextBase.cancelElection now blindly deletes the leader node, which sounds just as dangerous. From your comment it seems like you just wanted it to expire out, so I'm wondering if it's just a merge bug or something. In general, I think it'd make a lot of sense to predicate the writing of the leader node on the election node still having the same session as the thread thinks (using the same zookeeper multi-transactional semantics as in ZkController.markShardAsDownIfLeader), so that a thread that went GCing before writing the leader node will fail when it comes back since its election node will have expired.
        Hide
        Mark Miller added a comment - - edited

        Mark Miller, not sure if this is intended--looks like the newly added ShardLeaderElectionContextBase.cancelElection now blindly deletes the leader node, which sounds just as dangerous. From your comment it seems like you just wanted it to expire out, so I'm wondering if it's just a merge bug or something.

        No, it seems we need it because sometimes the leader is lost without an expiration. Some tests won't pass without. I think this is why that first delete was put in - a quick fix.

        We do want to make sure we only remove our own registration though. We should be able to track that with the znode version and ensure we don't remove another candidate's entry with an optimistic delete?

        In general, I think it'd make a lot of sense to predicate the writing of the leader node on the election node still having the same session as the thread thinks (using the same zookeeper multi-transactional semantics as in ZkController.markShardAsDownIfLeader), so that a thread that went GCing before writing the leader node will fail when it comes back since its election node will have expired.

        That sounds like a good idea as well.

        Show
        Mark Miller added a comment - - edited Mark Miller, not sure if this is intended--looks like the newly added ShardLeaderElectionContextBase.cancelElection now blindly deletes the leader node, which sounds just as dangerous. From your comment it seems like you just wanted it to expire out, so I'm wondering if it's just a merge bug or something. No, it seems we need it because sometimes the leader is lost without an expiration. Some tests won't pass without. I think this is why that first delete was put in - a quick fix. We do want to make sure we only remove our own registration though. We should be able to track that with the znode version and ensure we don't remove another candidate's entry with an optimistic delete? In general, I think it'd make a lot of sense to predicate the writing of the leader node on the election node still having the same session as the thread thinks (using the same zookeeper multi-transactional semantics as in ZkController.markShardAsDownIfLeader), so that a thread that went GCing before writing the leader node will fail when it comes back since its election node will have expired. That sounds like a good idea as well.
        Hide
        Jessica Cheng Mallet added a comment -

        We do want to make sure we only remove our own registration though. We should be able to track that with the znode version and ensure we don't remove another candidate's entry with an optimistic delete?

        Sounds good!

        Thanks for clarifying!

        Show
        Jessica Cheng Mallet added a comment - We do want to make sure we only remove our own registration though. We should be able to track that with the znode version and ensure we don't remove another candidate's entry with an optimistic delete? Sounds good! Thanks for clarifying!
        Hide
        Mark Miller added a comment -

        using the same zookeeper multi-transactional semantics as in ZkController.markShardAsDownIfLeader)

        Hmm ... is there an easy way to check session id this way though? Perhaps we should open up a new issue to try for this improvement.

        Show
        Mark Miller added a comment - using the same zookeeper multi-transactional semantics as in ZkController.markShardAsDownIfLeader) Hmm ... is there an easy way to check session id this way though? Perhaps we should open up a new issue to try for this improvement.
        Hide
        Mark Miller added a comment -

        Another patch that only deletes a registered leader node if the current election context owns it.

        Show
        Mark Miller added a comment - Another patch that only deletes a registered leader node if the current election context owns it.
        Hide
        Mark Miller added a comment -

        I'm also sad to not see a way to get the version back from a created node? Just the path?

        Show
        Mark Miller added a comment - I'm also sad to not see a way to get the version back from a created node? Just the path?
        Hide
        Jessica Cheng Mallet added a comment -

        Ah, well, "create" means version is 0 (or whatever initial version is) right? Otherwise you get a NodeExists back. Hmm...

        Show
        Jessica Cheng Mallet added a comment - Ah, well, "create" means version is 0 (or whatever initial version is) right? Otherwise you get a NodeExists back. Hmm...
        Hide
        Mark Miller added a comment -

        I did not know you could count on that, but good point, I'll read up on versions again and work that in.

        Show
        Mark Miller added a comment - I did not know you could count on that, but good point, I'll read up on versions again and work that in.
        Hide
        Mark Miller added a comment -

        Okay, here is a more bullet proof idea perhaps.

        We use the multi to create the leader znode and get the cversion of the parent node of the leader znode.

        We then conditionally delete on cancel with multi again, conditional on the parent nodes cversion.

        Rough patch attached.

        Show
        Mark Miller added a comment - Okay, here is a more bullet proof idea perhaps. We use the multi to create the leader znode and get the cversion of the parent node of the leader znode. We then conditionally delete on cancel with multi again, conditional on the parent nodes cversion. Rough patch attached.
        Hide
        Mark Miller added a comment -

        No, they don't let you compare the cversion on the part where we use the multi to delete - darn.

        Show
        Mark Miller added a comment - No, they don't let you compare the cversion on the part where we use the multi to delete - darn.
        Hide
        Jessica Cheng Mallet added a comment -

        Since you're doing a setData on the parent (and thereby bumping the parent's version) each time you create the leaderPath, you should be able to rely on the parent's version as well, instead of its cversion.

        Since you're doing the multi already, might as well add "ops.add(Op.check(leaderSeqPath, -1));" right before the Op.create?

        Show
        Jessica Cheng Mallet added a comment - Since you're doing a setData on the parent (and thereby bumping the parent's version) each time you create the leaderPath, you should be able to rely on the parent's version as well, instead of its cversion. Since you're doing the multi already, might as well add "ops.add(Op.check(leaderSeqPath, -1));" right before the Op.create?
        Hide
        Mark Miller added a comment -

        Yeah, I tried that trick after some googling, but the parent version and cversion seem to move out of the sync. I have not spotted why yet, but it seems to tricky to enforce or test. Running low on options though.

        Show
        Mark Miller added a comment - Yeah, I tried that trick after some googling, but the parent version and cversion seem to move out of the sync. I have not spotted why yet, but it seems to tricky to enforce or test. Running low on options though.
        Hide
        Jessica Cheng Mallet added a comment -

        The parent's cversion will change if the child node expires, whereas in this case the version won't change--so they won't be in sync. But that's ok. Are you seeing any case where while the leader node stayed there (didn't change) that the parent version changed?

        Show
        Jessica Cheng Mallet added a comment - The parent's cversion will change if the child node expires, whereas in this case the version won't change--so they won't be in sync. But that's ok. Are you seeing any case where while the leader node stayed there (didn't change) that the parent version changed?
        Hide
        Mark Miller added a comment -

        But that's ok.

        Okay, so I had a slightly different thought in my head. Now I understand what you meant. That works I think.

        I just have to figure out why the leader rebalance test is failing, but otherwise I think this works.

        Show
        Mark Miller added a comment - But that's ok. Okay, so I had a slightly different thought in my head. Now I understand what you meant. That works I think. I just have to figure out why the leader rebalance test is failing, but otherwise I think this works.
        Hide
        Mark Miller added a comment -

        Man, this leader rebalance stuff looks really hairy. Not much appetite to dig into it. I'm already spotting likely bugs and odd things.

        Show
        Mark Miller added a comment - Man, this leader rebalance stuff looks really hairy. Not much appetite to dig into it. I'm already spotting likely bugs and odd things.
        Hide
        Mark Miller added a comment -

        Blah. Perhaps it's working now.

        Rough patch attached.

        Show
        Mark Miller added a comment - Blah. Perhaps it's working now. Rough patch attached.
        Hide
        Mark Miller added a comment -

        Still some issue with rebalance leaders.

        Show
        Mark Miller added a comment - Still some issue with rebalance leaders.
        Hide
        Mark Miller added a comment -

        Okay, still in rough edges shape, but this patch should address everything.

        We do have to move the advertised leader node in zk so that we can have a shard specific non ephemeral version to track.

        Perhaps in 5x we can also just still write that data to the now parent node?

        Show
        Mark Miller added a comment - Okay, still in rough edges shape, but this patch should address everything. We do have to move the advertised leader node in zk so that we can have a shard specific non ephemeral version to track. Perhaps in 5x we can also just still write that data to the now parent node?
        Hide
        Mark Miller added a comment -

        So, machine1 won the race with machine2 and declared its leadership and created the nodes. However, machine3 has just deleted them, and recreated them for itself. So machine1 and machine3 both believe they are the leader.

        This is an interesting state. Mike Roberts, do you have any notes on the following behavior in this state? Other nodes should consider the leader whoever is advertised in ZK as that is what lookups will return. Even the leader that did not win the advertisement should consider the winning node the real leader, and in cases updates came in anyway, our defensive checks should prevent the losing leader from accepting anything?

        In any case, the latest patch should use ZK to ensure only one node can ever think it's the leader.

        Show
        Mark Miller added a comment - So, machine1 won the race with machine2 and declared its leadership and created the nodes. However, machine3 has just deleted them, and recreated them for itself. So machine1 and machine3 both believe they are the leader. This is an interesting state. Mike Roberts , do you have any notes on the following behavior in this state? Other nodes should consider the leader whoever is advertised in ZK as that is what lookups will return. Even the leader that did not win the advertisement should consider the winning node the real leader, and in cases updates came in anyway, our defensive checks should prevent the losing leader from accepting anything? In any case, the latest patch should use ZK to ensure only one node can ever think it's the leader.
        Hide
        Mark Miller added a comment -

        Here is a cleaned up patch with all tests passing.

        Show
        Mark Miller added a comment - Here is a cleaned up patch with all tests passing.
        Hide
        Mark Miller added a comment -

        Let me know what you think Jessica Cheng Mallet. I think this is pretty close.

        Show
        Mark Miller added a comment - Let me know what you think Jessica Cheng Mallet . I think this is pretty close.
        Hide
        Jessica Cheng Mallet added a comment -

        This looks good to me.

        The two comments I have are the following:
        1. It'd be good to add some "explanation/documentation" type comment in ElectionContext to describe what we're trying to accomplish with the zk multi-transaction. For example, why can we rely on the parent version, etc.
        2. Will the change to the leaderProps in ZkController (adding CORE_NODE_NAME_PROP) make this change non-backwards compatible?

        Related but unrelated--how do you guys usually make line comments in a JIRA patch situation? Here I only have two comments so it's pretty tractable, but I can see it being difficult if it's a large change, etc.

        Thanks!

        Show
        Jessica Cheng Mallet added a comment - This looks good to me. The two comments I have are the following: 1. It'd be good to add some "explanation/documentation" type comment in ElectionContext to describe what we're trying to accomplish with the zk multi-transaction. For example, why can we rely on the parent version, etc. 2. Will the change to the leaderProps in ZkController (adding CORE_NODE_NAME_PROP) make this change non-backwards compatible? Related but unrelated--how do you guys usually make line comments in a JIRA patch situation? Here I only have two comments so it's pretty tractable, but I can see it being difficult if it's a large change, etc. Thanks!
        Hide
        Mark Miller added a comment -

        2. Will the change to the leaderProps in ZkController (adding CORE_NODE_NAME_PROP) make this change non-backwards compatible?

        I think it's kind of an internal API, but we could accept both for 5x.

        how do you guys usually make line comments in a JIRA patch situation?

        I signed us up for ReviewBoard sometime back.

        It has the downside of splitting the discussion between JIRA and ReviewBoard and never really got much traction.

        I tried to jump start it anyway because of two key features:

        1. Line comments
        2. Ability to easily see diffs between patches

        I'd be happy to put any patch on it by request.

        Show
        Mark Miller added a comment - 2. Will the change to the leaderProps in ZkController (adding CORE_NODE_NAME_PROP) make this change non-backwards compatible? I think it's kind of an internal API, but we could accept both for 5x. how do you guys usually make line comments in a JIRA patch situation? I signed us up for ReviewBoard sometime back. It has the downside of splitting the discussion between JIRA and ReviewBoard and never really got much traction. I tried to jump start it anyway because of two key features: 1. Line comments 2. Ability to easily see diffs between patches I'd be happy to put any patch on it by request.
        Hide
        Shawn Heisey added a comment -

        Being able to see diffs between patches would be pretty awesome. I have on occasion used diff on patches in bash, but without color syntax highlighting, it's hard to read. Something that would be really cool for that particular use case is multiple colors, not just two – one color set for the additions and removals in the underlying patch, one for the differences between the patches. I'm thinking about another color set for parts of the text where the two overlap, but until I actually see it, I don't know if that would truly be useful.

        I saw things on the mailing list about ReviewBoard, but it didn't jump out as directly useful to me. I will pay more attention the next time something shows up from it.

        Show
        Shawn Heisey added a comment - Being able to see diffs between patches would be pretty awesome. I have on occasion used diff on patches in bash, but without color syntax highlighting, it's hard to read. Something that would be really cool for that particular use case is multiple colors, not just two – one color set for the additions and removals in the underlying patch, one for the differences between the patches. I'm thinking about another color set for parts of the text where the two overlap, but until I actually see it, I don't know if that would truly be useful. I saw things on the mailing list about ReviewBoard, but it didn't jump out as directly useful to me. I will pay more attention the next time something shows up from it.
        Hide
        Mark Miller added a comment -

        Okay, I think we are ready to commit this.

        For 5x I will accept NODE_NAME as well as CORE_NODE_NAME for the rebalance API and write the current leader info to the old registration node as well.

        Show
        Mark Miller added a comment - Okay, I think we are ready to commit this. For 5x I will accept NODE_NAME as well as CORE_NODE_NAME for the rebalance API and write the current leader info to the old registration node as well.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700603 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1700603 ]

        SOLR-7844: Zookeeper session expiry during shard leader election can cause multiple leaders.

        Show
        ASF subversion and git services added a comment - Commit 1700603 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1700603 ] SOLR-7844 : Zookeeper session expiry during shard leader election can cause multiple leaders.
        Hide
        Mark Miller added a comment -

        Here is the 5x patch.

        Show
        Mark Miller added a comment - Here is the 5x patch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1700853 from Mark Miller in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1700853 ]

        SOLR-7844: Zookeeper session expiry during shard leader election can cause multiple leaders.

        Show
        ASF subversion and git services added a comment - Commit 1700853 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1700853 ] SOLR-7844 : Zookeeper session expiry during shard leader election can cause multiple leaders.
        Hide
        Mark Miller added a comment -

        Thanks guys!

        Show
        Mark Miller added a comment - Thanks guys!
        Hide
        Shai Erera added a comment -

        Mark Miller this seems to break upgrading existing 5x (e.g. 5.3) clusters to 5.4, unless I missed a "migration" step. If you're doing a rolling upgrade, such that you take one of the nodes down, replace the JARs to 5.4 and restart the node, you'll see such exceptions:

        org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1
            at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1034)
            at org.apache.solr.cloud.ZkController.register(ZkController.java:940)
            at org.apache.solr.cloud.ZkController.register(ZkController.java:883)
            at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:184)
            at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:213)
            at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:696)
            at org.apache.solr.core.CoreContainer.create(CoreContainer.java:750)
            at org.apache.solr.core.CoreContainer.create(CoreContainer.java:716)
            at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:623)
            at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:204)
            at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:184)
            at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156)
            at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664)
            at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438)
            at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
            at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181)
                ...
        Caused by: org.apache.solr.common.SolrException: Could not get leader props
            at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1081)
            at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1045)
            at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1001)
            ... 35 more
        Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/acg-test-1/leaders/shard1/leader
            at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
            at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
            at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
            at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
            at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
            at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
            at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
            at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1059)
        

        When the 5.4 nodes come up, they don't find /collections/coll/shard/leader1 path and fail. I am not quite sure how to recover this though, since the cluster has a mixture of 5.3 and 5.4 nodes. I cannot create .../shard1/leader since ../shard1 is an EPHEMERAL node and therefore can't create child nodes. I am not sure what will happen if I delete "../shard1" and recreate it as non EPHEMERAL, will the old 5.3 nodes work? I also need to ensure that the new 5.4 node doesn't become the leader if it wasn't already.

        Perhaps a fix would be for 5.4 to fallback to read the leader info from "../shard1"? Then when the last 5.3 node is down, the leader will be attempted by a 5.4 node which will recreate the leader path according to the 5.4 format? Should this have been a zk version change?

        I'd appreciate some guidance here.

        Show
        Shai Erera added a comment - Mark Miller this seems to break upgrading existing 5x (e.g. 5.3) clusters to 5.4, unless I missed a "migration" step. If you're doing a rolling upgrade, such that you take one of the nodes down, replace the JARs to 5.4 and restart the node, you'll see such exceptions: org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1 at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1034) at org.apache.solr.cloud.ZkController.register(ZkController.java:940) at org.apache.solr.cloud.ZkController.register(ZkController.java:883) at org.apache.solr.core.ZkContainer$2.run(ZkContainer.java:184) at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:213) at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:696) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:750) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:716) at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:623) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:204) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:184) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181) ... Caused by: org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1081) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1045) at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1001) ... 35 more Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/acg-test-1/leaders/shard1/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1059) When the 5.4 nodes come up, they don't find /collections/coll/shard/leader1 path and fail. I am not quite sure how to recover this though, since the cluster has a mixture of 5.3 and 5.4 nodes. I cannot create .../shard1/leader since ../shard1 is an EPHEMERAL node and therefore can't create child nodes. I am not sure what will happen if I delete "../shard1" and recreate it as non EPHEMERAL, will the old 5.3 nodes work? I also need to ensure that the new 5.4 node doesn't become the leader if it wasn't already. Perhaps a fix would be for 5.4 to fallback to read the leader info from "../shard1"? Then when the last 5.3 node is down, the leader will be attempted by a 5.4 node which will recreate the leader path according to the 5.4 format? Should this have been a zk version change? I'd appreciate some guidance here.
        Hide
        Shai Erera added a comment -

        I think that if we also check for the leader under "/shard1" as a fallback, the situation will resolve itself? I.e. the new 5.4 nodes will come up finding a leader. When that leader dies, the "shard1" EPHEMERAL node will be deleted, and a 5.4 node will create the proper structure in ZK? What do you think?

        Show
        Shai Erera added a comment - I think that if we also check for the leader under "/shard1" as a fallback, the situation will resolve itself? I.e. the new 5.4 nodes will come up finding a leader. When that leader dies, the "shard1" EPHEMERAL node will be deleted, and a 5.4 node will create the proper structure in ZK? What do you think?
        Hide
        Mark Miller added a comment -

        Yeah, 5x needs a little bridge back compat that checks the old location if the new one does not exist.

        We don't have any tests at all for rolling updates, so it is all a bit hit or miss, but I believe that should work out fine.

        Show
        Mark Miller added a comment - Yeah, 5x needs a little bridge back compat that checks the old location if the new one does not exist. We don't have any tests at all for rolling updates, so it is all a bit hit or miss, but I believe that should work out fine.
        Hide
        Shai Erera added a comment -

        Where is the best place to check that in your opinion? If we do something in ZkStateReader.getLeaderPath(), then we cover both ZkController.getLeaderProps() and also ShardLeaderElectionContextBase. As I see it, the latter may also log failures to remove a leader node, while attempting to delete "shard1/leader".

        So in ZkStateReader.getLeaderPath(), we can perhaps check if "shard1" is an ephemeral node (looking at its ephemeralOwner value – 0 means it's not an ephemeral node), it means this is a pre-5.4 leader, and we return it as the leader path? Otherwise, we return shard1/leader?

        Am I over-thinking it, and this only needs to be handled in ZkController.getLeaderProps(), catching NoNodeException and attempting to read the props from the parent? I ask because I'm not sure what role ShardLeaderElectionContextBase plays here.

        Show
        Shai Erera added a comment - Where is the best place to check that in your opinion? If we do something in ZkStateReader.getLeaderPath() , then we cover both ZkController.getLeaderProps() and also ShardLeaderElectionContextBase . As I see it, the latter may also log failures to remove a leader node, while attempting to delete "shard1/leader". So in ZkStateReader.getLeaderPath() , we can perhaps check if "shard1" is an ephemeral node (looking at its ephemeralOwner value – 0 means it's not an ephemeral node), it means this is a pre-5.4 leader, and we return it as the leader path? Otherwise, we return shard1/leader? Am I over-thinking it, and this only needs to be handled in ZkController.getLeaderProps() , catching NoNodeException and attempting to read the props from the parent? I ask because I'm not sure what role ShardLeaderElectionContextBase plays here.
        Hide
        Mark Miller added a comment -

        Am I over-thinking it, and this only needs to be handled in ZkController.getLeaderProps(), catching NoNodeException and attempting to read the props from the parent?

        I think so.

        I ask because I'm not sure what role ShardLeaderElectionContextBase plays here.

        This sets where it will get written out. We don't need to change that, we want to write out to the new spot, so I think the above is probably all that needs to be done with an initial glance over.

        Show
        Mark Miller added a comment - Am I over-thinking it, and this only needs to be handled in ZkController.getLeaderProps(), catching NoNodeException and attempting to read the props from the parent? I think so. I ask because I'm not sure what role ShardLeaderElectionContextBase plays here. This sets where it will get written out. We don't need to change that, we want to write out to the new spot, so I think the above is probably all that needs to be done with an initial glance over.
        Hide
        Shai Erera added a comment -

        Reopening to adddress "upgrade from 5.3" issue.

        Show
        Shai Erera added a comment - Reopening to adddress "upgrade from 5.3" issue.
        Hide
        Shai Erera added a comment -

        Mark Miller, can you please review this patch? If it looks OK to you, I'd like to try get it out in 5.4.1.

        Show
        Shai Erera added a comment - Mark Miller , can you please review this patch? If it looks OK to you, I'd like to try get it out in 5.4.1.
        Hide
        Mark Miller added a comment -

        I can look in the morning. We should use the other issue and just link to this one. This issue is released and essentially frozen now.

        Show
        Mark Miller added a comment - I can look in the morning. We should use the other issue and just link to this one. This issue is released and essentially frozen now.
        Hide
        Shai Erera added a comment -

        OK I will open a separate issue.

        Show
        Shai Erera added a comment - OK I will open a separate issue.
        Hide
        Shai Erera added a comment -

        Opened SOLR-8561

        Show
        Shai Erera added a comment - Opened SOLR-8561

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mike Roberts
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development