Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7065

Let a replica become the leader regardless of it's last published state if all replicas participate in the election process.

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    1. SOLR-7065.patch
      38 kB
      Mark Miller
    2. SOLR-7065.patch
      27 kB
      Mark Miller

      Issue Links

        Activity

        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        First whack at a draft patch.

        Show
        markrmiller@gmail.com Mark Miller added a comment - First whack at a draft patch.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        This is working pretty well, but the new testing leads to some shard inconsistency fails I have to track down.

        Show
        markrmiller@gmail.com Mark Miller added a comment - This is working pretty well, but the new testing leads to some shard inconsistency fails I have to track down.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        My latest state.

        Show
        markrmiller@gmail.com Mark Miller added a comment - My latest state.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        I never really liked that the leader would try to publish the down state for replicas. In this tougher test, you can run into leaders that have a DOWN state and think they are an active leader. I've taken that out in my latest work. I think there are races there that are hard to reason and we should usually avoid a node publishing state for another node.

        Show
        markrmiller@gmail.com Mark Miller added a comment - I never really liked that the leader would try to publish the down state for replicas. In this tougher test, you can run into leaders that have a DOWN state and think they are an active leader. I've taken that out in my latest work. I think there are races there that are hard to reason and we should usually avoid a node publishing state for another node.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Ugg...the leader election code was hard enough to grasp before all of this out of order stuff.

        Love to see the following log line for a shard:

        log.info("was going to be leader {} , seq(0) {}", context.leaderSeqPath, holdElectionPath + "/" + seqs.get(0));//but someone else jumped the line

        And then of course no new leader...where are you line jumper?

        Show
        markrmiller@gmail.com Mark Miller added a comment - Ugg...the leader election code was hard enough to grasp before all of this out of order stuff. Love to see the following log line for a shard: log.info("was going to be leader {} , seq(0) {}", context.leaderSeqPath, holdElectionPath + "/" + seqs.get(0));//but someone else jumped the line And then of course no new leader...where are you line jumper?
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        This zk leader election recipe has spiraled out of control. It was super delicate to begin with, now it's a monster that can't be trusted or barley reasoned out fully. I think someone assumed we have much more thorough testing for it then we do.

        Show
        markrmiller@gmail.com Mark Miller added a comment - This zk leader election recipe has spiraled out of control. It was super delicate to begin with, now it's a monster that can't be trusted or barley reasoned out fully. I think someone assumed we have much more thorough testing for it then we do.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Probably we should look at using a better algorithm for ordering - having enough replicas to cause a 'thundering herd' problem is not too realistic and so a less scalable and better suited algorithm will be a lot better than trying to cram crazy on complicated.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Probably we should look at using a better algorithm for ordering - having enough replicas to cause a 'thundering herd' problem is not too realistic and so a less scalable and better suited algorithm will be a lot better than trying to cram crazy on complicated.
        Hide
        erickerickson Erick Erickson added a comment - - edited

        Yeah, I started to take a whack at it at one point, basically taking control of the ordering of the election queue but abandoned it due to time constraints. One problem is that we're bastardizing the whole ephemeral election process in ZK and resorting to the "tie breaker" code that does things like "find the next guy and jump down two, unless you're within the first two of the head in which case do nothing". And the sorting is sensitive to the session ID to boot.

        The TestRebalanceLeaders code exercises the shard leader election, we can see if we can extend it. I'm not sure how robust it is when nodes are flaky.

        You mentioned at one point that you wondered whether the whole "watch the guy in front" and ZKs ephemeral-sequential node was the right way to approach this. The hack I started still used that mechanism, just took better control of how nodes were inserted into the leader election queue so I don't think that approach really addresses why this has spun out of control.

        I really wonder if we should change the mechanism. It seems to me that the fundamental fragility (apart from how hard the code is to understand) is that if the sequence of who watches which ephemeral node somehow gets out of whack, there is no mechanism for letting the other nodes in the queue know that there's a problem that needs to be sorted out which can result in no leaders I assume. Certainly happened often enough to me.

        I wonder if tying leader election into ZK state changes rather than watching the ephemeral election node-in-front is a better way?

        This has not been thought out, but what about something like:

        Solr gets a notification of state change from ZK and drops into the "should I be leader" code which gets significantly less complex.
        -1> If I'm not active, ??? Probably just return assuming the next state change will re-trigger this code.
        0> If I'm not in the election queue, put myself at the tail. (handles mysterious out-of-whack situations)
        1> If there is a leader and it's active, return. (if it's in the middle of going down, we should get another state change when it's down, right?)
        2a> If some other node is both active and the preferred leader return (again depending on a state change message if that node goes down to get back to this code)
        2b> If I'm the preferred leader, take over leadership.
        3> If any other node in the leader election queue in front of me is active, return (state change gets us back here if those nodes are going down).
        4> take over leadership.

        Since this operates off of state changes to ZK, it seems like it gives us the chance to recover from weird situations. I don't think it increases traffic, don't all ZK state changes have to go to all nodes anyway?

        I'm not sure in this case whether we even need a leader election queue at all. Is the clusterstate any less robust than the election queue? Even if it would be just as good, not sure how you'd express "the node in front". Actually, a simple counter property in the state for each replica would do it maybe. You'd set it at one more than any other node in the collection when a node changed its state to "active". I'll freely admit though, you've seen a lot more in the weeds here than I have so I'll defer to your experience.

        Anyway, let's kick the tires of what's to be done, maybe we can tag-team this. I consider the above just a jumping-off point to tame this beast. Be glad to chat if you or anyone else wants to kick it around...

        One thing I'm not real clear on is how up-to-date the ZK cluster state is. Since changing the state is done through the Overseer, how to insure that the state is current when making decisions?

        Show
        erickerickson Erick Erickson added a comment - - edited Yeah, I started to take a whack at it at one point, basically taking control of the ordering of the election queue but abandoned it due to time constraints. One problem is that we're bastardizing the whole ephemeral election process in ZK and resorting to the "tie breaker" code that does things like "find the next guy and jump down two, unless you're within the first two of the head in which case do nothing". And the sorting is sensitive to the session ID to boot. The TestRebalanceLeaders code exercises the shard leader election, we can see if we can extend it. I'm not sure how robust it is when nodes are flaky. You mentioned at one point that you wondered whether the whole "watch the guy in front" and ZKs ephemeral-sequential node was the right way to approach this. The hack I started still used that mechanism, just took better control of how nodes were inserted into the leader election queue so I don't think that approach really addresses why this has spun out of control. I really wonder if we should change the mechanism. It seems to me that the fundamental fragility (apart from how hard the code is to understand) is that if the sequence of who watches which ephemeral node somehow gets out of whack, there is no mechanism for letting the other nodes in the queue know that there's a problem that needs to be sorted out which can result in no leaders I assume. Certainly happened often enough to me. I wonder if tying leader election into ZK state changes rather than watching the ephemeral election node-in-front is a better way? This has not been thought out, but what about something like: Solr gets a notification of state change from ZK and drops into the "should I be leader" code which gets significantly less complex. -1> If I'm not active, ??? Probably just return assuming the next state change will re-trigger this code. 0> If I'm not in the election queue, put myself at the tail. (handles mysterious out-of-whack situations) 1> If there is a leader and it's active, return. (if it's in the middle of going down, we should get another state change when it's down, right?) 2a> If some other node is both active and the preferred leader return (again depending on a state change message if that node goes down to get back to this code) 2b> If I'm the preferred leader, take over leadership. 3> If any other node in the leader election queue in front of me is active, return (state change gets us back here if those nodes are going down). 4> take over leadership. Since this operates off of state changes to ZK, it seems like it gives us the chance to recover from weird situations. I don't think it increases traffic, don't all ZK state changes have to go to all nodes anyway? I'm not sure in this case whether we even need a leader election queue at all. Is the clusterstate any less robust than the election queue? Even if it would be just as good, not sure how you'd express "the node in front". Actually, a simple counter property in the state for each replica would do it maybe. You'd set it at one more than any other node in the collection when a node changed its state to "active". I'll freely admit though, you've seen a lot more in the weeds here than I have so I'll defer to your experience. Anyway, let's kick the tires of what's to be done, maybe we can tag-team this. I consider the above just a jumping-off point to tame this beast. Be glad to chat if you or anyone else wants to kick it around... One thing I'm not real clear on is how up-to-date the ZK cluster state is. Since changing the state is done through the Overseer, how to insure that the state is current when making decisions?
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        You mentioned at one point that you wondered whether the whole "watch the guy in front" and ZKs ephemeral-sequential node was the right way to approach this.

        Right. This entire approach is an elegant ZooKeeper recipe that is actually quite difficult to program perfectly. It's point is to prevent a thundering herd problem when you have tons of nodes involved in the election - with simpler approaches, if a leader goes down, now you can have everyone in the election checking the same nodes about what has changed and this can cause problems. Except that you never have more than a handful of replicas. Even 20 replicas is kind of crazy, and it's still not even close to a herd.

        This elegant solution is hard to nail, hard to test properly, and as can be seen, not very good for dealing with priorities and altering the election line.

        A very simple solution that involves the overseer or optimistic locking / writing would be much, much simpler for re ordering the election.

        Show
        markrmiller@gmail.com Mark Miller added a comment - You mentioned at one point that you wondered whether the whole "watch the guy in front" and ZKs ephemeral-sequential node was the right way to approach this. Right. This entire approach is an elegant ZooKeeper recipe that is actually quite difficult to program perfectly. It's point is to prevent a thundering herd problem when you have tons of nodes involved in the election - with simpler approaches, if a leader goes down, now you can have everyone in the election checking the same nodes about what has changed and this can cause problems. Except that you never have more than a handful of replicas. Even 20 replicas is kind of crazy, and it's still not even close to a herd. This elegant solution is hard to nail, hard to test properly, and as can be seen, not very good for dealing with priorities and altering the election line. A very simple solution that involves the overseer or optimistic locking / writing would be much, much simpler for re ordering the election.
        Hide
        andyetitmoves Ramkumar Aiyengar added a comment -

        not very good for dealing with priorities

        In particular, the follow the leader approach makes it very hard to do things like assign arbitrary priorities to nodes (the current jump the queue stuff works because there are only two priorities, extending it to more will make it untenable very soon). You could on the other hand come up with a solution to do that with optimistic locking..

        Show
        andyetitmoves Ramkumar Aiyengar added a comment - not very good for dealing with priorities In particular, the follow the leader approach makes it very hard to do things like assign arbitrary priorities to nodes (the current jump the queue stuff works because there are only two priorities, extending it to more will make it untenable very soon). You could on the other hand come up with a solution to do that with optimistic locking..
        Hide
        erickerickson Erick Erickson added a comment -

        Hmm, so in the optimistic locking case are you thinking of bypassing the Overseer completely for the shard leader election case? And maybe the overseer election case?

        Show
        erickerickson Erick Erickson added a comment - Hmm, so in the optimistic locking case are you thinking of bypassing the Overseer completely for the shard leader election case? And maybe the overseer election case?
        Hide
        markrmiller@gmail.com Mark Miller added a comment - - edited

        Right, the optimistic election approach would not necessarily need to use the Overseer. It's similar to how each node would update the clusterstate itself (before the Overseer was added) - and if someone else changed it first, you retry.

        Show
        markrmiller@gmail.com Mark Miller added a comment - - edited Right, the optimistic election approach would not necessarily need to use the Overseer. It's similar to how each node would update the clusterstate itself (before the Overseer was added) - and if someone else changed it first, you retry.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        This is an important but tough issue to finish. Hope to get back to it after my hand heals up.

        Show
        markrmiller@gmail.com Mark Miller added a comment - This is an important but tough issue to finish. Hope to get back to it after my hand heals up.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        I think some of the problems I hit while working on this were due to bugs - some of which have been fixed by now. I'm going to see if I can get this patch up to trunk soon.

        Show
        markrmiller@gmail.com Mark Miller added a comment - I think some of the problems I hit while working on this were due to bugs - some of which have been fixed by now. I'm going to see if I can get this patch up to trunk soon.
        Hide
        mdrob Mike Drob added a comment -

        Recently saw something that might be this. Started trying to bring your patch up to current master, but ran into issues; some of the changes that you had in this patch got committed as part of SOLR-7033. I also didn't understand the advantage of returning an int instead of a boolean for sync. It looks like you used it to provide a ternary indicator of error, no sync necessary, or sync completed? That code changed a bunch with the fingerprinting from SOLR-8586.

        A specific case that doesn't make sense was

        SyncStrategy.java
             if (SKIP_AUTO_RECOVERY) {
        -      return true;
        +      return -1;
             }
        

        Should this be return 0?

        I see a lot of design discussion in this JIRA prior, but not a lot of consensus. What do you think is the easiest way forward from here, Mark Miller

        Show
        mdrob Mike Drob added a comment - Recently saw something that might be this. Started trying to bring your patch up to current master, but ran into issues; some of the changes that you had in this patch got committed as part of SOLR-7033 . I also didn't understand the advantage of returning an int instead of a boolean for sync. It looks like you used it to provide a ternary indicator of error, no sync necessary, or sync completed? That code changed a bunch with the fingerprinting from SOLR-8586 . A specific case that doesn't make sense was SyncStrategy.java if (SKIP_AUTO_RECOVERY) { - return true ; + return -1; } Should this be return 0 ? I see a lot of design discussion in this JIRA prior, but not a lot of consensus. What do you think is the easiest way forward from here, Mark Miller
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        We are skipping recovery, so we want to return -1 (success).

        I think this is a tricky issue. Requires a bit of thought to make sure it's all okay. But I think I roughly had what we need in the patch. The main issue was that the test exposed some kind of problem where no leader would be elected. I think this may now be okay since another issue has been resolved. Most of the discussion above is not very related to this patch.

        Show
        markrmiller@gmail.com Mark Miller added a comment - We are skipping recovery, so we want to return -1 (success). I think this is a tricky issue. Requires a bit of thought to make sure it's all okay. But I think I roughly had what we need in the patch. The main issue was that the test exposed some kind of problem where no leader would be elected. I think this may now be okay since another issue has been resolved. Most of the discussion above is not very related to this patch.
        Hide
        mdrob Mike Drob added a comment -

        We are skipping recovery, so we want to return -1 (success).

        That's... not what I expected. Can you explain what the possible return values mean? I got the impression that the three options are -1, 0, and > 0?

        Show
        mdrob Mike Drob added a comment - We are skipping recovery, so we want to return -1 (success). That's... not what I expected. Can you explain what the possible return values mean? I got the impression that the three options are -1, 0, and > 0?
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        That wasn't a finished patch, just work on getting this to work, so don't insist on sticking to any of the impl.

        But, if I remember right (and I may not), it was -1 means success, synced with all replicas and > 0 is how many replicas were synced with if it wasn't all of them. In that case, 0 would mean, did not sync with all of them, and actually synced with 0 of them.

        Show
        markrmiller@gmail.com Mark Miller added a comment - That wasn't a finished patch, just work on getting this to work, so don't insist on sticking to any of the impl. But, if I remember right (and I may not), it was -1 means success, synced with all replicas and > 0 is how many replicas were synced with if it wasn't all of them. In that case, 0 would mean, did not sync with all of them, and actually synced with 0 of them.
        Hide
        mdrob Mike Drob added a comment -

        I had an updated version of the tests that was failing correctly on master, but it looks like it started to pass after SOLR-7280.

        Occasionally I will see the test stall out, but it doesn't actually fail, just never completes. And it's not consistent between runs using the same seed. I still think we're in a better place now than we were before, though.

        Show
        mdrob Mike Drob added a comment - I had an updated version of the tests that was failing correctly on master, but it looks like it started to pass after SOLR-7280 . Occasionally I will see the test stall out, but it doesn't actually fail, just never completes. And it's not consistent between runs using the same seed. I still think we're in a better place now than we were before, though.

          People

          • Assignee:
            markrmiller@gmail.com Mark Miller
            Reporter:
            markrmiller@gmail.com Mark Miller
          • Votes:
            1 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:

              Development