Details

    • Flags:
      Patch, Important

      Description

      This patch is still somewhat WIP for a couple of reasons:

      1) Still debugging test failures.
      2) This will more scrutiny from knowledgable folks!

      There are some subtle bugs with the current implementation of LeaderElector, best demonstrated by the following test:

      1) Start up a small single-node solrcloud. it should be become Overseer.
      2) kill -9 the solrcloud process and immediately start a new one.
      3) The new process won't become overseer. The old process's ZK leader elect node has not yet disappeared, and the new process fails to set appropriate watches.

      NOTE: this is only reproducible if the new node is able to start up and join the election quickly.

      1. OverseerTestFail.log
        625 kB
        Mark Miller
      2. SOLR-8697.patch
        9 kB
        Scott Blum
      3. SOLR-8697-followup.patch
        9 kB
        Scott Blum

        Issue Links

          Activity

          Show
          dragonsinth Scott Blum added a comment - Shalin Shekhar Mangar Erick Erickson Mark Miller
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The old process's ZK leader elect node has not yet disappeared, and the new process fails to set appropriate watches.

          I thought we waited for a while for the old one to go away...is that missing in the Overseer election code and just in the shard election code?

          Show
          markrmiller@gmail.com Mark Miller added a comment - The old process's ZK leader elect node has not yet disappeared, and the new process fails to set appropriate watches. I thought we waited for a while for the old one to go away...is that missing in the Overseer election code and just in the shard election code?
          Hide
          dragonsinth Scott Blum added a comment -

          TBH, the code is pretty hard to follow in its existing form, and a lot of the theory of operation and design is difficult to distill. There is code that looks like it's trying to do what you're talking about, but it's buggy.

          Take this analysis with like 70% confidence: The existing code has a very hazy definition of "myself" and "previous registration", from what I can tell. The previous elect node from a different process qualifies as "myself" in some cases, so the new process thinks it's already registered when it's not.

          Show
          dragonsinth Scott Blum added a comment - TBH, the code is pretty hard to follow in its existing form, and a lot of the theory of operation and design is difficult to distill. There is code that looks like it's trying to do what you're talking about, but it's buggy. Take this analysis with like 70% confidence: The existing code has a very hazy definition of "myself" and "previous registration", from what I can tell. The previous elect node from a different process qualifies as "myself" in some cases, so the new process thinks it's already registered when it's not.
          Hide
          dragonsinth Scott Blum added a comment -

          Mark Miller Erick Erickson

          I think there is a potential problem with how OverseerTest is constructed, that perhaps caused us to write some code into LeaderElector in the past that doesn't make any sense for live code.

          I'm looking at the implementation of MockZkController.publishState() (it's kind of a beast) and I notice that when it creates an ElectionContext, it never actually adds it to the map, checks whether one already exists, etc. As a result, MockZkController does something the real ZkController never does – it tries to register two different election contexts for the same core on the same ZK session.

          My question is, what's the right fix? I can either make MockZkController not setup a new electionContext on subsequent invocations, or I could make it simply cancel the previous election context before creating a new one.

          Show
          dragonsinth Scott Blum added a comment - Mark Miller Erick Erickson I think there is a potential problem with how OverseerTest is constructed, that perhaps caused us to write some code into LeaderElector in the past that doesn't make any sense for live code. I'm looking at the implementation of MockZkController.publishState() (it's kind of a beast) and I notice that when it creates an ElectionContext, it never actually adds it to the map, checks whether one already exists, etc. As a result, MockZkController does something the real ZkController never does – it tries to register two different election contexts for the same core on the same ZK session. My question is, what's the right fix? I can either make MockZkController not setup a new electionContext on subsequent invocations, or I could make it simply cancel the previous election context before creating a new one.
          Hide
          dragonsinth Scott Blum added a comment -

          Found another bug in ShardLeaderElectionContextBase.

          cancelElection() and runLeaderProcess() can race with each other. If the local process is trying to cancel right as it becomes leader, cancelElection() won't see a leaderZkNodeParentVersion yet, so it won't try to delete the leader registration. Meanwhile, runLeaderProcess() still succeeds in creating the leader registration. The call to super.cancelElection() does remove us from the queue, but the dead leader registration is left there.

          I think moving the call to super.cancelElection() to the start of ShardLeaderElectionContextBase.cancelElection() should resolve the race, because actually removing the election node will cause the multi to fail with a NoNode rather than a NodeExists, and it won't get stuck in a retry loop.

          Show
          dragonsinth Scott Blum added a comment - Found another bug in ShardLeaderElectionContextBase. cancelElection() and runLeaderProcess() can race with each other. If the local process is trying to cancel right as it becomes leader, cancelElection() won't see a leaderZkNodeParentVersion yet, so it won't try to delete the leader registration. Meanwhile, runLeaderProcess() still succeeds in creating the leader registration. The call to super.cancelElection() does remove us from the queue, but the dead leader registration is left there. I think moving the call to super.cancelElection() to the start of ShardLeaderElectionContextBase.cancelElection() should resolve the race, because actually removing the election node will cause the multi to fail with a NoNode rather than a NodeExists, and it won't get stuck in a retry loop.
          Hide
          dragonsinth Scott Blum added a comment -

          Okay, all the tests are passing now, I think this is ready for review.

          Show
          dragonsinth Scott Blum added a comment - Okay, all the tests are passing now, I think this is ready for review.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          TBH, the code is pretty hard to follow in its existing form

          Yup. It was mildly hairy in its first form (copying the ZK recipe as described) and took a while to harden. Then some contributions came that just made it insane to follow. I've brought it up before, instead of trying to avoid thundering herd issues with what will be a reasonably low number of replicas trying to be leader, we probably should just have very simple leader elections. All of the original logic, and the logic that was added that made it really hard for me to follow, would be really simple if we gave up the cool elegant approach we used to avoid a mostly non existent thundering herd issue. That thicket is just a ripe breeding ground for random bugs our tests just don't easily expose.

          At this point, the effort to change reliably is probably really high though.

          Show
          markrmiller@gmail.com Mark Miller added a comment - TBH, the code is pretty hard to follow in its existing form Yup. It was mildly hairy in its first form (copying the ZK recipe as described) and took a while to harden. Then some contributions came that just made it insane to follow. I've brought it up before, instead of trying to avoid thundering herd issues with what will be a reasonably low number of replicas trying to be leader, we probably should just have very simple leader elections. All of the original logic, and the logic that was added that made it really hard for me to follow, would be really simple if we gave up the cool elegant approach we used to avoid a mostly non existent thundering herd issue. That thicket is just a ripe breeding ground for random bugs our tests just don't easily expose. At this point, the effort to change reliably is probably really high though.
          Hide
          dragonsinth Scott Blum added a comment - - edited

          I think part of the general problem with a lot of the ZK-interacting code is a lack of clean separation of concerns. The relationships between LeaderElector and the various ElectionContext subclasses are pretty gnarly and incestuous. DistributedQueue had a similar kind of design problem before I extracted the app specific gnarly parts into OverseerTaskQueue.

          Have we considered trying to migrate to, say, Apache Curator (full disclosure: I'm a committer)? There are a lot of advantages to using third party libs for some of these common patterns like distributed queue, leader election, or even observing changes in a tree. Those components tend to be reusable, better documented, with cleaner APIs, and have a natural resistance to spaghetti invasion. (Examples: OverseerNodePrioritizer and RebalanceLeaders are intricately tied to implementation details of LeaderElector.)

          A clean, reusable leader election component (with its own tests) that could simply be used in a few different contexts seems like a good place to be longer term.

          That said, I hope this patch can simply clean up some up the existing bugs without being too disruptive.

          Show
          dragonsinth Scott Blum added a comment - - edited I think part of the general problem with a lot of the ZK-interacting code is a lack of clean separation of concerns. The relationships between LeaderElector and the various ElectionContext subclasses are pretty gnarly and incestuous. DistributedQueue had a similar kind of design problem before I extracted the app specific gnarly parts into OverseerTaskQueue. Have we considered trying to migrate to, say, Apache Curator (full disclosure: I'm a committer)? There are a lot of advantages to using third party libs for some of these common patterns like distributed queue, leader election, or even observing changes in a tree. Those components tend to be reusable, better documented, with cleaner APIs, and have a natural resistance to spaghetti invasion. (Examples: OverseerNodePrioritizer and RebalanceLeaders are intricately tied to implementation details of LeaderElector.) A clean, reusable leader election component (with its own tests) that could simply be used in a few different contexts seems like a good place to be longer term. That said, I hope this patch can simply clean up some up the existing bugs without being too disruptive.
          Hide
          markrmiller@gmail.com Mark Miller added a comment - - edited

          Curator has come up before. Personally, I have not wanted to try and mimic what we have or go through a protracted hardening process again. This stuff is all very touchy, and our tests def do not catch everything, so a rip and replace at that low level would be both very difficult and sure to introduce a lot of issues.

          I think a lot of the problem is that devs like to favor just tossing crap on top of what exists, rather than trying to wholistically move the design forward or make it right for what they want to add (Examples: OverseerNodePrioritizer and RebalanceLeaders - which also made the election code much more dense). I feel a lot of "let's just make this work". I can't tell you how surprised I've been that some devs have come and built so much on some of the prototype code I initially laid out. I've always thought, how do you build so much on this without finding/fixing more core bugs and seeing other necessary improvements more things as you go? Not that it doesn't happen, but the scale has historically been way below what I think makes sense. Easy for me to say I guess. Anyway, it's great that you have already filed a bunch of issues

          I'd rather focus on some refactoring than bringing in curator though. The implications of that would be pretty large and we have plenty of other more pressing issues I think.

          Show
          markrmiller@gmail.com Mark Miller added a comment - - edited Curator has come up before. Personally, I have not wanted to try and mimic what we have or go through a protracted hardening process again. This stuff is all very touchy, and our tests def do not catch everything, so a rip and replace at that low level would be both very difficult and sure to introduce a lot of issues. I think a lot of the problem is that devs like to favor just tossing crap on top of what exists, rather than trying to wholistically move the design forward or make it right for what they want to add (Examples: OverseerNodePrioritizer and RebalanceLeaders - which also made the election code much more dense). I feel a lot of "let's just make this work". I can't tell you how surprised I've been that some devs have come and built so much on some of the prototype code I initially laid out. I've always thought, how do you build so much on this without finding/fixing more core bugs and seeing other necessary improvements more things as you go? Not that it doesn't happen, but the scale has historically been way below what I think makes sense. Easy for me to say I guess. Anyway, it's great that you have already filed a bunch of issues I'd rather focus on some refactoring than bringing in curator though. The implications of that would be pretty large and we have plenty of other more pressing issues I think.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          full disclosure: I'm a committer

          And you worked on GWT! Awesome. Have not used it in years, still madly in love with GWT.

          using third party libs

          We may have started with curator but when we started (query side of SolrCloud was 2009 or 2010?) it was not around or we did not know about it back then if it was.

          Show
          markrmiller@gmail.com Mark Miller added a comment - full disclosure: I'm a committer And you worked on GWT! Awesome. Have not used it in years, still madly in love with GWT. using third party libs We may have started with curator but when we started (query side of SolrCloud was 2009 or 2010?) it was not around or we did not know about it back then if it was.
          Hide
          dragonsinth Scott Blum added a comment -

          Yeah, totally agreed on refactoring and trying to fix core bugs! Bringing in Curator at some point would be something I'd only advocate for incrementally and in pieces, like replace our DQ with Curator's, etc. Moving everything over at in a short period of time would be a pipe dream anyway.

          Back on the topic of LeaderElector, I think this patch is in a pretty good state now. The only thing I want to consider doing in the short term (after this patch) is that, in addition to watching the node ahead of you, I think we should also be watching our own node, whether or not we're leader. If an outside party forcibly deletes our node, we should put ourselves at the back of the line. If you think about it, if we could trust that behavior, something like RebalanceLeaders wouldn't even need to be a distributed request; overseer could just delete the current leader elect node and trust the owner to do the right thing.

          Show
          dragonsinth Scott Blum added a comment - Yeah, totally agreed on refactoring and trying to fix core bugs! Bringing in Curator at some point would be something I'd only advocate for incrementally and in pieces, like replace our DQ with Curator's, etc. Moving everything over at in a short period of time would be a pipe dream anyway. Back on the topic of LeaderElector, I think this patch is in a pretty good state now. The only thing I want to consider doing in the short term (after this patch) is that, in addition to watching the node ahead of you, I think we should also be watching our own node, whether or not we're leader. If an outside party forcibly deletes our node, we should put ourselves at the back of the line. If you think about it, if we could trust that behavior, something like RebalanceLeaders wouldn't even need to be a distributed request; overseer could just delete the current leader elect node and trust the owner to do the right thing.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          You can leave the old patches by the way. We tend to leave the history and just pull the latest patch with the same file name.

          Show
          markrmiller@gmail.com Mark Miller added a comment - You can leave the old patches by the way. We tend to leave the history and just pull the latest patch with the same file name.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Bringing in Curator at some point would be something I'd only advocate for incrementally and in pieces, like replace our DQ with Curator's, etc.

          Yeah, I suppose if we had some consensus to push it forward over time, that's a viable option.

          If an outside party forcibly deletes our node, we should put ourselves at the back of the line.

          Yeah, that sounds like an interesting improvement. Much nicer than making a bunch of distrib calls.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Bringing in Curator at some point would be something I'd only advocate for incrementally and in pieces, like replace our DQ with Curator's, etc. Yeah, I suppose if we had some consensus to push it forward over time, that's a viable option. If an outside party forcibly deletes our node, we should put ourselves at the back of the line. Yeah, that sounds like an interesting improvement. Much nicer than making a bunch of distrib calls.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          FYI, we are in a special little place where we can break back compat and don't have to consider rolling upgrades because the next release is 6.0. We don't have much time before 6.0 branches though I think.

          Patch looks good to me. Others should take a look as well, but I'll commit to get Jenkins cranking on it.

          Show
          markrmiller@gmail.com Mark Miller added a comment - FYI, we are in a special little place where we can break back compat and don't have to consider rolling upgrades because the next release is 6.0. We don't have much time before 6.0 branches though I think. Patch looks good to me. Others should take a look as well, but I'll commit to get Jenkins cranking on it.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 9418369b46586818467109e482b70ba41e90d4ed in lucene-solr's branch refs/heads/master from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9418369 ]

          SOLR-8697: Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 9418369b46586818467109e482b70ba41e90d4ed in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9418369 ] SOLR-8697 : Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          cancelElection() and runLeaderProcess() can race with each other. If the local process is trying to cancel right as it becomes leader, cancelElection() won't see a leaderZkNodeParentVersion yet, so it won't try to delete the leader registration. Meanwhile, runLeaderProcess() still succeeds in creating the leader registration. The call to super.cancelElection() does remove us from the queue, but the dead leader registration is left there.

          Any thoughts on why the existing stress tests for leader election can't catch this? Can we beef something up?

          Show
          markrmiller@gmail.com Mark Miller added a comment - cancelElection() and runLeaderProcess() can race with each other. If the local process is trying to cancel right as it becomes leader, cancelElection() won't see a leaderZkNodeParentVersion yet, so it won't try to delete the leader registration. Meanwhile, runLeaderProcess() still succeeds in creating the leader registration. The call to super.cancelElection() does remove us from the queue, but the dead leader registration is left there. Any thoughts on why the existing stress tests for leader election can't catch this? Can we beef something up?
          Hide
          dragonsinth Scott Blum added a comment -

          Actually OverseerTest.testShardLeaderChange() DID catch this race for me. But only rarely. Debugging that flake is how I uncovered the race.

          Show
          dragonsinth Scott Blum added a comment - Actually OverseerTest.testShardLeaderChange() DID catch this race for me. But only rarely. Debugging that flake is how I uncovered the race.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Okay, good, but then was it after your changes? I don't recall seeing that test fail in a long, long time on our Jenkins 'cluster', and that's a bunch of machines running the tests continuously. I also don't recall it locally.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Okay, good, but then was it after your changes? I don't recall seeing that test fail in a long, long time on our Jenkins 'cluster', and that's a bunch of machines running the tests continuously. I also don't recall it locally.
          Hide
          dragonsinth Scott Blum added a comment -

          I think it's such a subtle race, that it would only generally show up with code changes-- but as little as changing logging could trigger it / not trigger it. So it might have been beaten into a state where it happened to work unless you breathed on it. Drove me nuts for the longest time debugging it until I stumbled on the race.

          Show
          dragonsinth Scott Blum added a comment - I think it's such a subtle race, that it would only generally show up with code changes-- but as little as changing logging could trigger it / not trigger it. So it might have been beaten into a state where it happened to work unless you breathed on it. Drove me nuts for the longest time debugging it until I stumbled on the race.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Note: Have not seen any fails like this in this test in a long, long time so may be related to this change. Perhaps just a test issue, because this retries for like 60 seconds or something.

             [junit4] ERROR   66.3s J4  | OverseerTest.testShardLeaderChange <<<
             [junit4]    > Throwable #1: org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([C4618609907C7E14:1A3201FE8AE48BE5]:0)
             [junit4]    > 	at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:212)
             [junit4]    > 	at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:173)
             [junit4]    > 	at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:138)
             [junit4]    > 	at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:310)
             [junit4]    > 	at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:219)
             [junit4]    > 	at org.apache.solr.cloud.OverseerTest$MockZKController.publishState(OverseerTest.java:181)
             [junit4]    > 	at org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:841)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
             [junit4]    > Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
             [junit4]    > 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
             [junit4]    > 	at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
             [junit4]    > 	at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
             [junit4]    > 	at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:577)
             [junit4]    > 	at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:574)
             [junit4]    > 	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
             [junit4]    > 	at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:574)
             [junit4]    > 	at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:195)
             [junit4]    > 	at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:49)
             [junit4]    > 	at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:42)
             [junit4]    > 	at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:178)
             [junit4]    > 	... 45 more
          
          Show
          markrmiller@gmail.com Mark Miller added a comment - Note: Have not seen any fails like this in this test in a long, long time so may be related to this change. Perhaps just a test issue, because this retries for like 60 seconds or something. [junit4] ERROR 66.3s J4 | OverseerTest.testShardLeaderChange <<< [junit4] > Throwable #1: org.apache.solr.common.SolrException: Could not register as the leader because creating the ephemeral registration node in ZooKeeper failed [junit4] > at __randomizedtesting.SeedInfo.seed([C4618609907C7E14:1A3201FE8AE48BE5]:0) [junit4] > at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:212) [junit4] > at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:173) [junit4] > at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:138) [junit4] > at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:310) [junit4] > at org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:219) [junit4] > at org.apache.solr.cloud.OverseerTest$MockZKController.publishState(OverseerTest.java:181) [junit4] > at org.apache.solr.cloud.OverseerTest.testShardLeaderChange(OverseerTest.java:841) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > Caused by: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists [junit4] > at org.apache.zookeeper.KeeperException.create(KeeperException.java:119) [junit4] > at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949) [junit4] > at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915) [junit4] > at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:577) [junit4] > at org.apache.solr.common.cloud.SolrZkClient$11.execute(SolrZkClient.java:574) [junit4] > at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) [junit4] > at org.apache.solr.common.cloud.SolrZkClient.multi(SolrZkClient.java:574) [junit4] > at org.apache.solr.cloud.ShardLeaderElectionContextBase$1.execute(ElectionContext.java:195) [junit4] > at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:49) [junit4] > at org.apache.solr.common.util.RetryUtil.retryOnThrowable(RetryUtil.java:42) [junit4] > at org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:178) [junit4] > ... 45 more
          Hide
          dragonsinth Scott Blum added a comment -

          I'll take a look. I ran into this before on account of the test code doing a couple of questionable things.

          1) Re-using the same ZK session for multiple election contexts on the same election. I fixed that one in my patch.

          2) Not deleting the registration on cancel. I didn't bother fixing this one, since I figured eventually the session timeout would kill all the ephemeral nodes, making for a more thorough test. But explicitly deleting the registration would speed the test up and make it more reliable, probably.

          Thoughts?

          Show
          dragonsinth Scott Blum added a comment - I'll take a look. I ran into this before on account of the test code doing a couple of questionable things. 1) Re-using the same ZK session for multiple election contexts on the same election. I fixed that one in my patch. 2) Not deleting the registration on cancel. I didn't bother fixing this one, since I figured eventually the session timeout would kill all the ephemeral nodes, making for a more thorough test. But explicitly deleting the registration would speed the test up and make it more reliable, probably. Thoughts?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Here is the log file just in case.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Here is the log file just in case.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          deleting the registration on cancel.

          +1.

          Show
          markrmiller@gmail.com Mark Miller added a comment - deleting the registration on cancel. +1.
          Hide
          dragonsinth Scott Blum added a comment -

          Attached a followup patch that cancels previous elections when the MockZkController is closed. I haven't actually been able to get OverseerTest.testShardLeaderChange() to flake on my machine, with or without the change, even running like 100 iterations. But

          Show
          dragonsinth Scott Blum added a comment - Attached a followup patch that cancels previous elections when the MockZkController is closed. I haven't actually been able to get OverseerTest.testShardLeaderChange() to flake on my machine, with or without the change, even running like 100 iterations. But
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          , even running like 100 iterations.

          Unfortunately, that is common. The tests are run in a wicked diverse set of envs. We see stuff from the jenkins cluster no dev ever seems to hit.

          A lot of fails only pop when running with other tests to also bog down the system, and you won't see the issue with that test in isolation, and then you can configure different numbers of tests to run at the same time (I do 10 on my 6 core machine), and the different levels of hardware...

          Some things are pretty hard to replicate locally.

          Show
          markrmiller@gmail.com Mark Miller added a comment - , even running like 100 iterations. Unfortunately, that is common. The tests are run in a wicked diverse set of envs. We see stuff from the jenkins cluster no dev ever seems to hit. A lot of fails only pop when running with other tests to also bog down the system, and you won't see the issue with that test in isolation, and then you can configure different numbers of tests to run at the same time (I do 10 on my 6 core machine), and the different levels of hardware... Some things are pretty hard to replicate locally.
          Hide
          dragonsinth Scott Blum added a comment -

          Actually, I'm stupid. The flaky problem is that I still didn't fix the race regarding leaderZkNodeParentVersion. I just made it harder to repro.

          The smoking gun is this line:

                log.info("No version found for ephemeral leader parent node, won't remove previous leader registration.");
          

          You can repro this pretty easily with the following change:

          ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
           -- a/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
           ++ b/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
          ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
          @@ -193,7 +193,7 @@ class ShardLeaderElectionContextBase extends ElectionContext {
                     List<OpResult> results;
                     
                     results = zkClient.multi(ops, true);
                     
                     Thread.sleep(10000);
                     for (OpResult result : results) {
                       if (result.getType() == ZooDefs.OpCode.setData) {
                         SetDataResult dresult = (SetDataResult) result;
          

          We need a harder synchronization around becoming leader vs. canceling.

          Show
          dragonsinth Scott Blum added a comment - Actually, I'm stupid. The flaky problem is that I still didn't fix the race regarding leaderZkNodeParentVersion. I just made it harder to repro. The smoking gun is this line: log.info( "No version found for ephemeral leader parent node, won't remove previous leader registration." ); You can repro this pretty easily with the following change: ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── -- a/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java ++ b/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── @@ -193,7 +193,7 @@ class ShardLeaderElectionContextBase extends ElectionContext { List<OpResult> results; results = zkClient.multi(ops, true ); Thread .sleep(10000); for (OpResult result : results) { if (result.getType() == ZooDefs.OpCode.setData) { SetDataResult dresult = (SetDataResult) result; We need a harder synchronization around becoming leader vs. canceling.
          Hide
          dragonsinth Scott Blum added a comment -

          Attached a patch with real synchronization. 99% sure this will fix the observed flake.
          I also included the MockZKController.close() election cancel change, I think it's a good cleanup but not related to the flake.

          Show
          dragonsinth Scott Blum added a comment - Attached a patch with real synchronization. 99% sure this will fix the observed flake. I also included the MockZKController.close() election cancel change, I think it's a good cleanup but not related to the flake.
          Hide
          dragonsinth Scott Blum added a comment -

          Mark Miller should I open a separate jira for the follow on patch?

          Show
          dragonsinth Scott Blum added a comment - Mark Miller should I open a separate jira for the follow on patch?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          No, we only open a new issue if this one has gone out in a release, otherwise we can keep iterating as needed in the same issue. A bit busy, but I'll get to this soon.

          Show
          markrmiller@gmail.com Mark Miller added a comment - No, we only open a new issue if this one has gone out in a release, otherwise we can keep iterating as needed in the same issue. A bit busy, but I'll get to this soon.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit efb7bb171b22a3c6a00d30eefe935a0024df0c71 in lucene-solr's branch refs/heads/master from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=efb7bb1 ]

          SOLR-8697: Add synchronization around registering as leader and canceling.

          Show
          jira-bot ASF subversion and git services added a comment - Commit efb7bb171b22a3c6a00d30eefe935a0024df0c71 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=efb7bb1 ] SOLR-8697 : Add synchronization around registering as leader and canceling.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit c2bc93822d73b66c74ed998ea6c57a3cce05af44 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bc938 ]

          SOLR-8697: Fix precommit failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit c2bc93822d73b66c74ed998ea6c57a3cce05af44 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2bc938 ] SOLR-8697 : Fix precommit failure
          Hide
          anshumg Anshum Gupta added a comment -

          Reopening to back port for 5.5.1

          Show
          anshumg Anshum Gupta added a comment - Reopening to back port for 5.5.1
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 0dabbadb4dbdabbad4da06a49e2eebd908a4cd62 in lucene-solr's branch refs/heads/branch_5x from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0dabbad ]

          SOLR-8697: Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 0dabbadb4dbdabbad4da06a49e2eebd908a4cd62 in lucene-solr's branch refs/heads/branch_5x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0dabbad ] SOLR-8697 : Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit d08169e6f6355d9023779cac1303ab48fe2c86e0 in lucene-solr's branch refs/heads/branch_5x from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d08169e ]

          SOLR-8697: Add synchronization around registering as leader and canceling.

          Show
          jira-bot ASF subversion and git services added a comment - Commit d08169e6f6355d9023779cac1303ab48fe2c86e0 in lucene-solr's branch refs/heads/branch_5x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d08169e ] SOLR-8697 : Add synchronization around registering as leader and canceling.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4b42c77ac6b045dce914d2244298cd0fea5300e4 in lucene-solr's branch refs/heads/branch_5x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b42c77 ]

          SOLR-8697: Fix precommit failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4b42c77ac6b045dce914d2244298cd0fea5300e4 in lucene-solr's branch refs/heads/branch_5x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4b42c77 ] SOLR-8697 : Fix precommit failure
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f8dd98f93b75ff774fd81c3a44d551d204676b94 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8dd98f ]

          SOLR-8697: Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements.

          Show
          jira-bot ASF subversion and git services added a comment - Commit f8dd98f93b75ff774fd81c3a44d551d204676b94 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f8dd98f ] SOLR-8697 : Scope ZK election nodes by session to prevent elections from interfering with each other and other small LeaderElector improvements.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 78bed536984dbfd4ba2f802deb58b29979d59329 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=78bed53 ]

          SOLR-8697: Add synchronization around registering as leader and canceling.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 78bed536984dbfd4ba2f802deb58b29979d59329 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=78bed53 ] SOLR-8697 : Add synchronization around registering as leader and canceling.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit f6fca6901665cab4bea078baa4350ddc8964a2cf in lucene-solr's branch refs/heads/branch_5_5 from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6fca69 ]

          SOLR-8697: Fix precommit failure

          Show
          jira-bot ASF subversion and git services added a comment - Commit f6fca6901665cab4bea078baa4350ddc8964a2cf in lucene-solr's branch refs/heads/branch_5_5 from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f6fca69 ] SOLR-8697 : Fix precommit failure
          Hide
          jwartes Jeff Wartes added a comment -

          Does this fix SOLR-6498?

          Show
          jwartes Jeff Wartes added a comment - Does this fix SOLR-6498 ?
          Hide
          dragonsinth Scott Blum added a comment -

          Probably so. I think we should close that one and assume it's fixed.

          Show
          dragonsinth Scott Blum added a comment - Probably so. I think we should close that one and assume it's fixed.

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              dragonsinth Scott Blum
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development