Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5991

SolrCloud: Add API to move leader off a Solr instance

    Details

    • Type: Bug
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.7.1
    • Fix Version/s: None
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Common maintenance chores require restarting Solr instances.

      The process of a shutdown becomes a whole lot more reliable if we can proactively move any leadership roles off of the Solr instance we are going to shut down. The leadership election process then runs immediately.

      I am not sure what the semantics should be (either accomplishes the goal but one of these might be best):

      • A call to tell a core to give up leadership (thus the next replica is chosen)
      • A call to specify which core should become the leader

        Issue Links

          Activity

          Hide
          erickerickson Erick Erickson added a comment -

          Hmm, it seems like SOLR-7569 is another way of going about this?

          Show
          erickerickson Erick Erickson added a comment - Hmm, it seems like SOLR-7569 is another way of going about this?
          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment -

          I did a very limited poc for this (without the REBALANCELEADERS), will upload the patch soon. I won't be working on this at least for a couple of weeks

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - I did a very limited poc for this (without the REBALANCELEADERS), will upload the patch soon. I won't be working on this at least for a couple of weeks
          Hide
          erickerickson Erick Erickson added a comment -

          I'm not going to work on this in the foreseeable future, so if Tomás Fernández Löbbe if you want to pick it up feel free. I think I originally took it as part of the REBALANCELEADERS stuff.

          Show
          erickerickson Erick Erickson added a comment - I'm not going to work on this in the foreseeable future, so if Tomás Fernández Löbbe if you want to pick it up feel free. I think I originally took it as part of the REBALANCELEADERS stuff.
          Hide
          erickerickson Erick Erickson added a comment -

          Right. My only point was that the scope of that work was limited.

          ADDROLE is, indeed, right here. Forgot about that.

          There is some infrastructure in place to force a leader onto the replica that has the preferredLeader property set, though it's not exposed at the collections API level. Perhaps add a parameter to the REBALANCELEADERS like shards=(comma separated list of shards) that would only force preferredLeader on the shards specified? That should map pretty easily into the current code. Whether we expose that to the end users or not is an open question, although I don't see any reason not to.

          I think I like the AVOID_RESPONSIBILITY idea (DECOMISSIONING is another possibility maybe?) rather than specific roles, but that's not a strong preference. And you're doing the work so you get to decide .

          Here's an idea from left field, just force the node into the "down" state and keep it there while doing the bookkeeping, including the deletereplica stuff? You'd have to insure that there was a live replica for each shard for each collection hosted on the node, but if so I think much of the other stuff is essentially automatic although I admit I don't know the details at this point. You'd also have to do the ADDREPLICA (step <4>) before shutting it down.... You could force the preferredLeader to another node for the relevant replicas, and if we add a shard list to REBALANCELEADERS force the leaders off the node before forcing it into the down state to control churn too.

          Minor nits
          <5> is unnecessary for any shard that has no replica with preferredLeader set on the node.

          It's probably sufficient to just ADDREPLICAPROP to some other node or BALANCESHARDUNIQUE for preferredLeader and let it go at that. REBALANCELEADERS can be a fairly heavy-weight process, and the preferredLeader thing is just a suggestion anyway. If things are temporarily unbalanced in terms of actual leadership that's probably fine. Short of someone issuing REBALANCELEADERS, I really don't expect all the preferredLeader to always be the leader anyway. What happens is that when a replica comes up, it inserts itself at the beginning of the queue for being elected leader and only really gets to be leader when the current leader goes down or REBALANCELEADERS is issued for that collection. At least that's how I remember it....

          Up to you of course. In fact it's not absolutely necessary to have any preferred leader either

          Show
          erickerickson Erick Erickson added a comment - Right. My only point was that the scope of that work was limited. ADDROLE is, indeed, right here. Forgot about that. There is some infrastructure in place to force a leader onto the replica that has the preferredLeader property set, though it's not exposed at the collections API level. Perhaps add a parameter to the REBALANCELEADERS like shards=(comma separated list of shards) that would only force preferredLeader on the shards specified? That should map pretty easily into the current code. Whether we expose that to the end users or not is an open question, although I don't see any reason not to. I think I like the AVOID_RESPONSIBILITY idea (DECOMISSIONING is another possibility maybe?) rather than specific roles, but that's not a strong preference. And you're doing the work so you get to decide . Here's an idea from left field, just force the node into the "down" state and keep it there while doing the bookkeeping, including the deletereplica stuff? You'd have to insure that there was a live replica for each shard for each collection hosted on the node, but if so I think much of the other stuff is essentially automatic although I admit I don't know the details at this point. You'd also have to do the ADDREPLICA (step <4>) before shutting it down.... You could force the preferredLeader to another node for the relevant replicas, and if we add a shard list to REBALANCELEADERS force the leaders off the node before forcing it into the down state to control churn too. Minor nits <5> is unnecessary for any shard that has no replica with preferredLeader set on the node. It's probably sufficient to just ADDREPLICAPROP to some other node or BALANCESHARDUNIQUE for preferredLeader and let it go at that. REBALANCELEADERS can be a fairly heavy-weight process, and the preferredLeader thing is just a suggestion anyway. If things are temporarily unbalanced in terms of actual leadership that's probably fine. Short of someone issuing REBALANCELEADERS, I really don't expect all the preferredLeader to always be the leader anyway. What happens is that when a replica comes up, it inserts itself at the beginning of the queue for being elected leader and only really gets to be leader when the current leader goes down or REBALANCELEADERS is issued for that collection. At least that's how I remember it.... Up to you of course. In fact it's not absolutely necessary to have any preferred leader either
          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment - - edited

          Right, the REBALANCE stuff was all about a condition where leaders for a collection were all landing on the same node...

          Yes, I understand and I'm not saying that that work is not needed. What I'm saying is that it may not be enough for the case of decommissioning the node.

          Assuming that there was a way to add node properties...

          That is what ADDROLE action in the collection API is doing, right?

          once that was set wouldn't DELETEREPLICA for all the replicas on that node essentially decommission it?

          Yes, but the idea is also make sure they are no leaders. In my mind the process of shutting down a server (e.g. node1) would be:

          1. REMOVEROLE DATA for node1 -> Not send any new replica for any existing or new collection to this node
          2. REMOVEROLE OVERSEER for node1 -> Don't let this host be elected as overseer
          3. for each replica in any collection in node1:
            ADDREPLICA -> Create new replicas in different nodes
          4. REMOVEROLE LEADER for node1 -> Don't let any of the replicas of this node to be shard leader
          5. REBALANCELEADER for any collection with at least one replica in node1
          6. DELETEREPLICA for all replicas in node1
          7. Shutdown node1

          In this scenario I'm considering that nodes are started always with the three of those roles, other option would be AVOID_RESPONSIBILITY and AVOID_DATA or something like that. In case of having to shutdown many hosts at once, you could do the steps from 1 to 3 in all the nodes, then proceed with 4 to 7

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - - edited Right, the REBALANCE stuff was all about a condition where leaders for a collection were all landing on the same node... Yes, I understand and I'm not saying that that work is not needed. What I'm saying is that it may not be enough for the case of decommissioning the node. Assuming that there was a way to add node properties... That is what ADDROLE action in the collection API is doing, right? once that was set wouldn't DELETEREPLICA for all the replicas on that node essentially decommission it? Yes, but the idea is also make sure they are no leaders. In my mind the process of shutting down a server (e.g. node1) would be: REMOVEROLE DATA for node1 -> Not send any new replica for any existing or new collection to this node REMOVEROLE OVERSEER for node1 -> Don't let this host be elected as overseer for each replica in any collection in node1: ADDREPLICA -> Create new replicas in different nodes REMOVEROLE LEADER for node1 -> Don't let any of the replicas of this node to be shard leader REBALANCELEADER for any collection with at least one replica in node1 DELETEREPLICA for all replicas in node1 Shutdown node1 In this scenario I'm considering that nodes are started always with the three of those roles, other option would be AVOID_RESPONSIBILITY and AVOID_DATA or something like that. In case of having to shutdown many hosts at once, you could do the steps from 1 to 3 in all the nodes, then proceed with 4 to 7
          Hide
          erickerickson Erick Erickson added a comment -

          Right, the "superceded"JIRA (SOLR-6491) has the fixed revision in it (5.0).

          Right, the REBALANCE stuff was all about a condition where leaders for a collection were all landing on the same node. There were many shards each with lots of replicas, so when starting the cluster fresh, the first node up would have a lot of leaders on it, and the additional load was noticeable. That code also doesn't really do much cross-collection, just within a collection.

          The current Solr code pretty much assumes that all nodes hosting a collection are similar, I think the larger discussion is heterogeneous environments where you have a mix of hardware capabilities. You can do this a little now by assigning collections and/or replicas to specific nodes, but that's a manual process.

          Assuming that there was a way to add node properties (your AVOID_RESPONSIBILITY flag), once that was set wouldn't DELETEREPLICA for all the replicas on that node essentially decommission it? I know we discussed the idea of node (as opposed to either collection or replica) properties, but don't think that went anywhere as it wasn't needed at the time. And any such property would have to be be checked in the collection and replica creation code....

          Show
          erickerickson Erick Erickson added a comment - Right, the "superceded"JIRA ( SOLR-6491 ) has the fixed revision in it (5.0). Right, the REBALANCE stuff was all about a condition where leaders for a collection were all landing on the same node. There were many shards each with lots of replicas, so when starting the cluster fresh, the first node up would have a lot of leaders on it, and the additional load was noticeable. That code also doesn't really do much cross-collection, just within a collection. The current Solr code pretty much assumes that all nodes hosting a collection are similar, I think the larger discussion is heterogeneous environments where you have a mix of hardware capabilities. You can do this a little now by assigning collections and/or replicas to specific nodes, but that's a manual process. Assuming that there was a way to add node properties (your AVOID_RESPONSIBILITY flag), once that was set wouldn't DELETEREPLICA for all the replicas on that node essentially decommission it? I know we discussed the idea of node (as opposed to either collection or replica) properties, but don't think that went anywhere as it wasn't needed at the time. And any such property would have to be be checked in the collection and replica creation code....
          Hide
          tomasflobbe Tomás Fernández Löbbe added a comment -

          I'd like to reopen this Jira, some of what was discussed here is not really resolved by the linked Jiras. The work done allows one to move the leader of a shard of a collection off of a node, however for the purpose of shutting down hosts (either a single one of multiple) this is not enough. First, in order to achieve this for multiple collections one should go though all the cores of a node and for each one, find a replica in the cluster in a node that is not being shutdown and make it leader. But even then, collections that are created (or replicas added) during this process can still land in these nodes that are trying to be removed.

          I like the idea that was discussed to use roles, like AVOID_RESPONSIBILITY or just LEADER (given that OVERSEER already exists) like Anshum proposed. Maybe there could also be a DATA role to allow/avoid new replicas landing in this host. These roles would act at the host level instead of per collection (like SOLR-6220 and SOLR-6491). This would still need the "rebalancereader" or "forceelection" action or something similar after the roles are set.

          I must confess that I'm not yet 100% sure how to implement those roles but wanted to bring this up and hear feedback before spending hours on it and maybe realizing it's not even possible. Also, backward compatibility may be a challenge.

          PS: I reopened the Jira instead of creating a new one because it doesn't have any commit or fix version and I wanted to keep the comments history, let me know if that's wrong and I need to create a new one.

          Show
          tomasflobbe Tomás Fernández Löbbe added a comment - I'd like to reopen this Jira, some of what was discussed here is not really resolved by the linked Jiras. The work done allows one to move the leader of a shard of a collection off of a node, however for the purpose of shutting down hosts (either a single one of multiple) this is not enough. First, in order to achieve this for multiple collections one should go though all the cores of a node and for each one, find a replica in the cluster in a node that is not being shutdown and make it leader. But even then, collections that are created (or replicas added) during this process can still land in these nodes that are trying to be removed. I like the idea that was discussed to use roles, like AVOID_RESPONSIBILITY or just LEADER (given that OVERSEER already exists) like Anshum proposed. Maybe there could also be a DATA role to allow/avoid new replicas landing in this host. These roles would act at the host level instead of per collection (like SOLR-6220 and SOLR-6491 ). This would still need the "rebalancereader" or "forceelection" action or something similar after the roles are set. I must confess that I'm not yet 100% sure how to implement those roles but wanted to bring this up and hear feedback before spending hours on it and maybe realizing it's not even possible. Also, backward compatibility may be a challenge. PS: I reopened the Jira instead of creating a new one because it doesn't have any commit or fix version and I wanted to keep the comments history, let me know if that's wrong and I need to create a new one.
          Hide
          erickerickson Erick Erickson added a comment -

          Fixed with the checkins under SOLR-6491.

          I'm calling this closed after the rebalancing done in SOLR-6517. There's still room for causing a specific node to become leader (or stop being leader), but let's raise that as a new JIRA if necessary.

          There's also probably some new functionality in the offing (no JIRA yet) to take a node offline for everything except indexing. The idea here is to do the minimal work to keep from having to re-synchronize. This'll avoid being Overseer, shard leader, and processing queries and only accept incoming updates. Details TBD.

          Anyway, between the two I think the functionality of this JIRA can be accomplished.

          Show
          erickerickson Erick Erickson added a comment - Fixed with the checkins under SOLR-6491 . I'm calling this closed after the rebalancing done in SOLR-6517 . There's still room for causing a specific node to become leader (or stop being leader), but let's raise that as a new JIRA if necessary. There's also probably some new functionality in the offing (no JIRA yet) to take a node offline for everything except indexing. The idea here is to do the minimal work to keep from having to re-synchronize. This'll avoid being Overseer, shard leader, and processing queries and only accept incoming updates. Details TBD. Anyway, between the two I think the functionality of this JIRA can be accomplished.
          Hide
          anshumg Anshum Gupta added a comment -

          SOLR-5495 might intersect with the design/approach for this one. Just wanted to bring it up here so that (if) who-ever is working on this issue know about potential related changes.

          Show
          anshumg Anshum Gupta added a comment - SOLR-5495 might intersect with the design/approach for this one. Just wanted to bring it up here so that (if) who-ever is working on this issue know about potential related changes.
          Hide
          andyetitmoves Ramkumar Aiyengar added a comment -

          Re: (1) : Ideally you want the avoid_responsibility rule to do just that, I.e. avoid that. In the off chance that the external client ends up doing that to all replicas of the shard, it should still be considered for leadership. This just makes it safer for the client. If not, it should at least be possible to do this only if safe, I.e. without a 'force' option, the node should reject a request to avoid responsibility if it's the only responsible node around..

          Show
          andyetitmoves Ramkumar Aiyengar added a comment - Re: (1) : Ideally you want the avoid_responsibility rule to do just that, I.e. avoid that. In the off chance that the external client ends up doing that to all replicas of the shard, it should still be considered for leadership. This just makes it safer for the client. If not, it should at least be possible to do this only if safe, I.e. without a 'force' option, the node should reject a request to avoid responsibility if it's the only responsible node around..
          Hide
          hossman Hoss Man added a comment -

          Off the cuff: it sounds like, what you'd really want for these types of usecases, is:

          1) an "AVOID_RESPONSIBILITY" role which tells a node it should never participate in elections – either for shard leader, or for overseer.
          2) per-node status info (from /admin/system) about whether this node is the overseer (SOLR-5823) and/or hosts the leader of any shard
          3) a "forceelection" Collection API action (that takes an optional collection name and shard name - so it can force overseer election, or leader election of all shards, or leader election of a specific shard)
          4) logic in CoreContainer.shutdown() that causes the node to do the following before finishing a clean shutdown:

          • act as if it has the AVOID_RESPONSIBILITY role (w/o updating it's actual zk state) until completion of shutdown
          • loop over it's current responsibilities and self-trigger the necessary "forceelection" commands to elect someone else to take it's place sa overseer/shard-leader(s)

          So...

          • if you just want to reboot one node - you reboot that node, and instead of just acting like it's droped off the face of the earth and potentially triggering elections when the ZK epheeral nodes vanish, it poactively encourages an election first.
          • If you want to shut down N machines permanently: you assign all of those N machines the role "AVOID_RESPONSIBILITY" in advance, and then iterate over them shutting them down. Ones that had no responsibilities to begin with will shutdown fast, nodes that did have responsibilities will shutdown slower as they force elections - but none of the other machines you are about to shutdown will take on those responsibilities.
          • If you want to reboot N machines with minimal down time: you can iterate over your N machines checking their /admin/system response to see if they are the overseer or a shard leader – if they are, you trigger the neccessary action=forceelection commands and wait for them to complete. when you are done, you should be able to shutdown/restart all N nodes very quickly, and then remove the "AVOID_RESPONSIBILITY" role at your lesuire.
          Show
          hossman Hoss Man added a comment - Off the cuff: it sounds like, what you'd really want for these types of usecases, is: 1) an "AVOID_RESPONSIBILITY" role which tells a node it should never participate in elections – either for shard leader, or for overseer. 2) per-node status info (from /admin/system) about whether this node is the overseer ( SOLR-5823 ) and/or hosts the leader of any shard 3) a "forceelection" Collection API action (that takes an optional collection name and shard name - so it can force overseer election, or leader election of all shards, or leader election of a specific shard) 4) logic in CoreContainer.shutdown() that causes the node to do the following before finishing a clean shutdown: act as if it has the AVOID_RESPONSIBILITY role (w/o updating it's actual zk state) until completion of shutdown loop over it's current responsibilities and self-trigger the necessary "forceelection" commands to elect someone else to take it's place sa overseer/shard-leader(s) So... if you just want to reboot one node - you reboot that node, and instead of just acting like it's droped off the face of the earth and potentially triggering elections when the ZK epheeral nodes vanish, it poactively encourages an election first. If you want to shut down N machines permanently: you assign all of those N machines the role "AVOID_RESPONSIBILITY" in advance, and then iterate over them shutting them down. Ones that had no responsibilities to begin with will shutdown fast, nodes that did have responsibilities will shutdown slower as they force elections - but none of the other machines you are about to shutdown will take on those responsibilities. If you want to reboot N machines with minimal down time: you can iterate over your N machines checking their /admin/system response to see if they are the overseer or a shard leader – if they are, you trigger the neccessary action=forceelection commands and wait for them to complete. when you are done, you should be able to shutdown/restart all N nodes very quickly, and then remove the "AVOID_RESPONSIBILITY" role at your lesuire.
          Hide
          rjernst Ryan Ernst added a comment -

          If you think that can work, sure. I was just explaining what I think the use case is more.

          Show
          rjernst Ryan Ernst added a comment - If you think that can work, sure. I was just explaining what I think the use case is more.
          Hide
          anshumg Anshum Gupta added a comment -

          Ryan Ernst That makes sense, but then can we reduce it to a node 'role' in that case ? Something that extends SOLR-5476 (for leaders or something though) ?

          Show
          anshumg Anshum Gupta added a comment - Ryan Ernst That makes sense, but then can we reduce it to a node 'role' in that case ? Something that extends SOLR-5476 (for leaders or something though) ?
          Hide
          rjernst Ryan Ernst added a comment -

          The problem there is imagine you are in a cloud service, and you are migrating to new hosts. You don't want solrcloud to select a different host, that you plan on taking down. 'quit' could work for this, but if you have 50 hosts, and you bring up another 50, I would imagine quitting each of the old 50 hosts in succession would cause some churn in decision making. You really want a way to say "these new nodes are preferred".

          Show
          rjernst Ryan Ernst added a comment - The problem there is imagine you are in a cloud service, and you are migrating to new hosts. You don't want solrcloud to select a different host, that you plan on taking down. 'quit' could work for this, but if you have 50 hosts, and you bring up another 50, I would imagine quitting each of the old 50 hosts in succession would cause some churn in decision making. You really want a way to say "these new nodes are preferred".
          Hide
          anshumg Anshum Gupta added a comment -

          Among other reasons, why I'd be interested in a 'quit' option is, in case the newly elected leader dies, the assumption goes for a toss. Business logic shouldn't be bothered about 'who' the leader is.

          When the forced leader dies, a new one gets elected and then business logic interrupts to have a re-election doesn't kind of make sense to me.

          Show
          anshumg Anshum Gupta added a comment - Among other reasons, why I'd be interested in a 'quit' option is, in case the newly elected leader dies, the assumption goes for a toss. Business logic shouldn't be bothered about 'who' the leader is. When the forced leader dies, a new one gets elected and then business logic interrupts to have a re-election doesn't kind of make sense to me.
          Hide
          RIch Mayfield Rich Mayfield added a comment -

          I'm not sure which approach would be easier to implement at this point. Something else to consider and discuss is what a "success" return from the API call means - maybe this might help in choosing an approach.

          I would hope that getting a "success" return means that leadership (whether given or taken) completed through the election process. If we tell a core to give up leadership, then we only return once leadership was successfully picked up by another core. Likewise, if we gell a core to take leadership, then we get a return once that core has leadership.

          Ideally, I don't want to make the call and then have to follow up with subsequent calls to see if the leader election process played out.

          Show
          RIch Mayfield Rich Mayfield added a comment - I'm not sure which approach would be easier to implement at this point. Something else to consider and discuss is what a "success" return from the API call means - maybe this might help in choosing an approach. I would hope that getting a "success" return means that leadership (whether given or taken) completed through the election process. If we tell a core to give up leadership, then we only return once leadership was successfully picked up by another core. Likewise, if we gell a core to take leadership, then we get a return once that core has leadership. Ideally, I don't want to make the call and then have to follow up with subsequent calls to see if the leader election process played out.
          Hide
          anshumg Anshum Gupta added a comment - - edited

          Would having a 'quit' mechanism make more sense instead? (something on the lines of the first approach).
          That would leave the decision making and leader-election to SolrCloud instead of the user.

          Show
          anshumg Anshum Gupta added a comment - - edited Would having a 'quit' mechanism make more sense instead? (something on the lines of the first approach). That would leave the decision making and leader-election to SolrCloud instead of the user.
          Hide
          jdconradson Jack Conradson added a comment -

          +1

          I can see both approaches being useful; however the second approach is the one that I would like to see.

          Show
          jdconradson Jack Conradson added a comment - +1 I can see both approaches being useful; however the second approach is the one that I would like to see.

            People

            • Assignee:
              Unassigned
              Reporter:
              RIch Mayfield Rich Mayfield
            • Votes:
              4 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development