Solr
  1. Solr
  2. SOLR-5796

With many collections, leader re-election takes too long when a node dies or is rebooted, leading to some shards getting into a "conflicting" state about who is the leader.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7.1, 4.8, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      Found on branch_4x

      Description

      I'm doing some testing with a 4-node SolrCloud cluster against the latest rev in branch_4x having many collections, 150 to be exact, each having 4 shards with rf=3, so 450 cores per node. Nodes are decent in terms of resources: -Xmx6g with 4 CPU - m3.xlarge's in EC2.

      The problem occurs when rebooting one of the nodes, say as part of a rolling restart of the cluster. If I kill one node and then wait for an extended period of time, such as 3 minutes, then all of the leaders on the downed node (roughly 150) have time to failover to another node in the cluster. When I restart the downed node, since leaders have all failed over successfully, the new node starts up and all cores assume the replica role in their respective shards. This is goodness and expected.

      However, if I don't wait long enough for the leader failover process to complete on the other nodes before restarting the downed node,
      then some bad things happen. Specifically, when the dust settles, many of the previous leaders on the node I restarted get stuck in the "conflicting" state seen in the ZkController, starting around line 852 in branch_4x:

      852 while (!leaderUrl.equals(clusterStateLeaderUrl)) {
      853 if (tries == 60)

      Unknown macro: { 854 throw new SolrException(ErrorCode.SERVER_ERROR, 855 "There is conflicting information about the leader of shard}

      859 Thread.sleep(1000);
      860 tries++;
      861 clusterStateLeaderUrl = zkStateReader.getLeaderUrl(collection, shardId,
      862 timeoutms);
      863 leaderUrl = getLeaderProps(collection, cloudDesc.getShardId(), timeoutms)
      864 .getCoreUrl();
      865 }

      As you can see, the code is trying to give a little time for this problem to work itself out, 1 minute to be exact. Unfortunately, that doesn't seem to be long enough for a busy cluster that has many collections. Now, one might argue that 450 cores per node is asking too much of Solr, however I think this points to a bigger issue of the fact that a node coming up isn't aware that it went down and leader election is running on other nodes and is just being slow. Moreover, once this problem occurs, it's not clear how to fix it besides shutting the node down again and waiting for leader failover to complete.

      It's also interesting to me that /clusterstate.json was updated by the healthy node taking over the leader role but the /collections/<coll>leaders/shard# was not updated? I added some debugging and it seems like the overseer queue is extremely backed up with work.

      Maybe the solution here is to just wait longer but I also want to get some feedback from the community on other options? I know there are some plans to help scale the Overseer (i.e. SOLR-5476) so maybe that helps and I'm trying to add more debug to see if this is really due to overseer backlog (which I suspect it is).

      In general, I'm a little confused by the keeping of leader state in multiple places in ZK. Is there any background information on why we have leader state in /clusterstate.json and in the leader path znode?

      Also, here are some interesting side observations:

      a. If I use rf=2, then this problem doesn't occur as leader failover happens more quickly and there's less overseer work?
      May be a red herring here, but I can consistently reproduce it with RF=3, but not with RF=2 ... suppose that is because there are only 300 cores per node versus 450 and that's just enough less work to make this issue work itself out.

      b. To support that many cores, I had to set -Xss256k to reduce the stack size as Solr uses a lot of threads during startup (high point was 800'ish)
      Might be something we should recommend on the mailing list / wiki somewhere.

      1. SOLR-5796.patch
        8 kB
        Timothy Potter

        Activity

        Hide
        Timothy Potter added a comment -

        Little more to this for today ...

        I tried this on more powerful nodes (m3.2xlarge) and changed the wait to tries == 180 (instead of 60) and viola, the restarted node came back as expected. This begs the question whether we should make that wait period configurable for those installations that have many collections in a cluster? To be clear, I'm referring to the wait period in ZkController, while loop starting around line 852 (see above). I'd prefer to have something more deterministic vs. an upper limit on waiting as that seems like a ticking time bomb in a busy cluster. I'm going to try a few more ideas out over the weekend.

        Show
        Timothy Potter added a comment - Little more to this for today ... I tried this on more powerful nodes (m3.2xlarge) and changed the wait to tries == 180 (instead of 60) and viola, the restarted node came back as expected. This begs the question whether we should make that wait period configurable for those installations that have many collections in a cluster? To be clear, I'm referring to the wait period in ZkController, while loop starting around line 852 (see above). I'd prefer to have something more deterministic vs. an upper limit on waiting as that seems like a ticking time bomb in a busy cluster. I'm going to try a few more ideas out over the weekend.
        Hide
        Mark Miller added a comment -

        Cool - recently saw a user post a problem with this conflicting state - glad to see you already have a jump on it

        Show
        Mark Miller added a comment - Cool - recently saw a user post a problem with this conflicting state - glad to see you already have a jump on it
        Hide
        Mark Miller added a comment -
        This begs the question whether we should make that wait period configurable for those installations that have many collections in a cluster? To be clear, I'm referring to the wait period in ZkController, while loop starting around line 852 (see above).
        

        I have no qualms with making any timeouts configurable, but it seems the defaults should be fairly high as well - we want to work out of the box with most reasonable setups if possible.

        +1 to making it configurable, but let's also crank it up.

        Show
        Mark Miller added a comment - This begs the question whether we should make that wait period configurable for those installations that have many collections in a cluster? To be clear, I'm referring to the wait period in ZkController, while loop starting around line 852 (see above). I have no qualms with making any timeouts configurable, but it seems the defaults should be fairly high as well - we want to work out of the box with most reasonable setups if possible. +1 to making it configurable, but let's also crank it up.
        Hide
        Timothy Potter added a comment -

        Ok, sounds good. Also, let's assume that the timeout resolves this for most cores, but a couple stragglers on the end still end up in this state. I assume the solution there would be to send a core reload to those specific cores, which should re-run all the startup recovery logic. If so, I think a simple utility app that uses CloudSolrServer to find all the angry cores and reload them would be useful, which I'll also work on.

        Show
        Timothy Potter added a comment - Ok, sounds good. Also, let's assume that the timeout resolves this for most cores, but a couple stragglers on the end still end up in this state. I assume the solution there would be to send a core reload to those specific cores, which should re-run all the startup recovery logic. If so, I think a simple utility app that uses CloudSolrServer to find all the angry cores and reload them would be useful, which I'll also work on.
        Hide
        Timothy Potter added a comment -

        Quick and dirty patch (for trunk) that introduces this as a ConfigSolrXml property (sibling to leaderVoteWait). However, getting this property passed along to where it's eventually needed in ZkController is a bit ugly. Specifically,

        a. default value (DEFAULT_LEADER_CONFLICT_RESOLVE_WAIT) in ConfigSolr is private (did this to match the default for leaderVoteWait). This makes it hard to not break signatures on public methods, such as ZkContainer#initZooKeeper, because this class doesn't have access to the default value.

        b. ZkControllerTest doesn't have access to the default value either for constructing a ZkController; so I've done something ugly and used a scalar value of 60000 in the test directly

        It seems like other config properties may be needed in the future, so I'm wondering if it's better to just make the ConfigSolr object available from the CoreContainer, which means adding:

        public ConfigSolr getConfig()

        { return cfg; }

        to the CoreContainer object. Is this a bad idea? If not, then ZkController can just pull this setting straight from the ConfigSolr object via the ref to CoreContainer.

        Or, maybe adding props like this is a rare-enough activity that I'm over-thinking this and introducing a new property into the public method / ctor signatures is OK?

        Show
        Timothy Potter added a comment - Quick and dirty patch (for trunk) that introduces this as a ConfigSolrXml property (sibling to leaderVoteWait). However, getting this property passed along to where it's eventually needed in ZkController is a bit ugly. Specifically, a. default value (DEFAULT_LEADER_CONFLICT_RESOLVE_WAIT) in ConfigSolr is private (did this to match the default for leaderVoteWait). This makes it hard to not break signatures on public methods, such as ZkContainer#initZooKeeper, because this class doesn't have access to the default value. b. ZkControllerTest doesn't have access to the default value either for constructing a ZkController; so I've done something ugly and used a scalar value of 60000 in the test directly It seems like other config properties may be needed in the future, so I'm wondering if it's better to just make the ConfigSolr object available from the CoreContainer, which means adding: public ConfigSolr getConfig() { return cfg; } to the CoreContainer object. Is this a bad idea? If not, then ZkController can just pull this setting straight from the ConfigSolr object via the ref to CoreContainer. Or, maybe adding props like this is a rare-enough activity that I'm over-thinking this and introducing a new property into the public method / ctor signatures is OK?
        Hide
        Timothy Potter added a comment -

        Hi Mark,

        So another thing I'm curious about is why does leader failover take so long here? Even without increasing the wait period, 60 seconds seems like a long time for a new leader to be elected esp when it's failing over to a healthy node. I get technically why it takes so long (propagating the /clusterstate.json change around the cluster), so I guess my question is what we can do to speed that up? Is SOLR-5476 (scaling the overseer) the solution here? Just wanted to get your thoughts on that as well.

        Thanks. Tim

        Show
        Timothy Potter added a comment - Hi Mark, So another thing I'm curious about is why does leader failover take so long here? Even without increasing the wait period, 60 seconds seems like a long time for a new leader to be elected esp when it's failing over to a healthy node. I get technically why it takes so long (propagating the /clusterstate.json change around the cluster), so I guess my question is what we can do to speed that up? Is SOLR-5476 (scaling the overseer) the solution here? Just wanted to get your thoughts on that as well. Thanks. Tim
        Hide
        Mark Miller added a comment -

        Honestly, I don't know why it's taking so long. We should get to the bottom of it.

        Propagating the clusterstate.json on a fast network should not take long at all.

        ZooKeeper should be apple to handle a very high rate of updates and reads - I'm not even sure why the Overseer is so slow at publishing state.

        We can find out though

        Show
        Mark Miller added a comment - Honestly, I don't know why it's taking so long. We should get to the bottom of it. Propagating the clusterstate.json on a fast network should not take long at all. ZooKeeper should be apple to handle a very high rate of updates and reads - I'm not even sure why the Overseer is so slow at publishing state. We can find out though
        Hide
        Mark Miller added a comment -

        When a leader tries to take over, it first tries to sync with the rest of the shard - we should look at how long this is taking. There are a variety of places we might be able to start logging time intervals to help get an idea how long various things take.

        Show
        Mark Miller added a comment - When a leader tries to take over, it first tries to sync with the rest of the shard - we should look at how long this is taking. There are a variety of places we might be able to start logging time intervals to help get an idea how long various things take.
        Hide
        Ramkumar Aiyengar added a comment -

        Doesn't Overseer throttle updates? It used to be a needless 1.5s before, but even now it does sleep 100ms between batches of updates I think (have to confirm with code). Maybe that's enough to slow things down enough in this case?

        Show
        Ramkumar Aiyengar added a comment - Doesn't Overseer throttle updates? It used to be a needless 1.5s before, but even now it does sleep 100ms between batches of updates I think (have to confirm with code). Maybe that's enough to slow things down enough in this case?
        Hide
        Mark Miller added a comment -

        AFAIK, Noble and I changed that so that if the queue has the items, the overseer will plow through them.

        Show
        Mark Miller added a comment - AFAIK, Noble and I changed that so that if the queue has the items, the overseer will plow through them.
        Hide
        Mark Miller added a comment -

        Let's spin off the performance issue into a new JIRA issue - I'll wrap this one up.

        Show
        Mark Miller added a comment - Let's spin off the performance issue into a new JIRA issue - I'll wrap this one up.
        Hide
        ASF subversion and git services added a comment -

        Commit 1574638 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1574638 ]

        SOLR-5796: Increase how long we are willing to wait for a core to see the ZK advertised leader in it's local state.
        SOLR-5796: Make how long we are willing to wait for a core to see the ZK advertised leader in it's local state configurable.

        Show
        ASF subversion and git services added a comment - Commit 1574638 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1574638 ] SOLR-5796 : Increase how long we are willing to wait for a core to see the ZK advertised leader in it's local state. SOLR-5796 : Make how long we are willing to wait for a core to see the ZK advertised leader in it's local state configurable.
        Hide
        ASF subversion and git services added a comment -

        Commit 1574641 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1574641 ]

        SOLR-5796: Increase how long we are willing to wait for a core to see the ZK advertised leader in it's local state.
        SOLR-5796: Make how long we are willing to wait for a core to see the ZK advertised leader in it's local state configurable.

        Show
        ASF subversion and git services added a comment - Commit 1574641 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574641 ] SOLR-5796 : Increase how long we are willing to wait for a core to see the ZK advertised leader in it's local state. SOLR-5796 : Make how long we are willing to wait for a core to see the ZK advertised leader in it's local state configurable.
        Hide
        ASF subversion and git services added a comment -

        Commit 1574664 from Mark Miller in branch 'dev/trunk'
        [ https://svn.apache.org/r1574664 ]

        SOLR-5796: Fix illegal API call to format.

        Show
        ASF subversion and git services added a comment - Commit 1574664 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1574664 ] SOLR-5796 : Fix illegal API call to format.
        Hide
        ASF subversion and git services added a comment -

        Commit 1574682 from Mark Miller in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1574682 ]

        SOLR-5796: Fix illegal API call to format.

        Show
        ASF subversion and git services added a comment - Commit 1574682 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574682 ] SOLR-5796 : Fix illegal API call to format.
        Hide
        Timothy Potter added a comment -

        I'm still a little concern about a couple of things:

        1) why is the leader "state" stored in two places in ZooKeeper (/clusterstate.json and /collections/<coll>/leaders/shard#)? I'm sure there is a good reason for this but can't see why

        2) if the timeout still occurs (as we don't want to wait forever), can't the node with the conflict just favor what's in the leader path assuming that replica is active and agrees? In other words, instead of throwing an exception and then just ending up a "down" state, why can't the replica seeing the conflict just go with what ZooKeeper says?

        I'm digging into leader failover timing / error handling today.

        Thanks. Tim

        Show
        Timothy Potter added a comment - I'm still a little concern about a couple of things: 1) why is the leader "state" stored in two places in ZooKeeper (/clusterstate.json and /collections/<coll>/leaders/shard#)? I'm sure there is a good reason for this but can't see why 2) if the timeout still occurs (as we don't want to wait forever), can't the node with the conflict just favor what's in the leader path assuming that replica is active and agrees? In other words, instead of throwing an exception and then just ending up a "down" state, why can't the replica seeing the conflict just go with what ZooKeeper says? I'm digging into leader failover timing / error handling today. Thanks. Tim
        Hide
        Mark Miller added a comment -
        • 1: At one time, the leader was not actually in the cluster state I think...anyway, we do make use of it in one case in recovery where we make sure we have an update cloudstate that includes the latest leader for the shard. The ZK node is always up to date, clusterstate info can be stale. That is the main difference.
        • 2: Perhaps - if the clusterstate won't update though, seems like something is probably fairly wrong and we may want to favor not claiming we are active.
        Show
        Mark Miller added a comment - 1: At one time, the leader was not actually in the cluster state I think...anyway, we do make use of it in one case in recovery where we make sure we have an update cloudstate that includes the latest leader for the shard. The ZK node is always up to date, clusterstate info can be stale. That is the main difference. 2: Perhaps - if the clusterstate won't update though, seems like something is probably fairly wrong and we may want to favor not claiming we are active.
        Hide
        Mark Miller added a comment -

        Lets roll out a perf investigation and other concerns into new issues.

        Show
        Mark Miller added a comment - Lets roll out a perf investigation and other concerns into new issues.
        Hide
        Timothy Potter added a comment -

        Just wanted to add a final comment here that it helps immensely to restart the node that is hosting the overseer last as part of a rolling restart. It's very subtle but there's a chance that each restart requires the overseer to need to be failed-over to a node each time you do a rolling restart. Consider 4 nodes: node1, node2, node3, node4. If the overseer is on node1, if I started my rolling restart using sequence: 2,3,4,1, then the overseer only needs to migrate once. This seems to really help stability during a rolling restart of the cluster (for obvious reasons).

        Show
        Timothy Potter added a comment - Just wanted to add a final comment here that it helps immensely to restart the node that is hosting the overseer last as part of a rolling restart. It's very subtle but there's a chance that each restart requires the overseer to need to be failed-over to a node each time you do a rolling restart. Consider 4 nodes: node1, node2, node3, node4. If the overseer is on node1, if I started my rolling restart using sequence: 2,3,4,1, then the overseer only needs to migrate once. This seems to really help stability during a rolling restart of the cluster (for obvious reasons).
        Hide
        Mark Miller added a comment -

        Do we have a JIRA issue for the instability you mention in the failover? I can guess what it is... but we should track it and harden it.

        Show
        Mark Miller added a comment - Do we have a JIRA issue for the instability you mention in the failover? I can guess what it is... but we should track it and harden it.
        Hide
        Steve Rowe added a comment -

        Mark Miller, Timothy Potter, any reason not to backport this to 4.7.1?

        Show
        Steve Rowe added a comment - Mark Miller , Timothy Potter , any reason not to backport this to 4.7.1?
        Hide
        ASF subversion and git services added a comment -

        Commit 1581195 from Steve Rowe in branch 'dev/branches/lucene_solr_4_7'
        [ https://svn.apache.org/r1581195 ]

        SOLR-5796: Increase how long we are willing to wait for a core to see the ZK advertised leader in it's local state.
        SOLR-5796: Make how long we are willing to wait for a core to see the ZK advertised leader in it's local state configurable.
        SOLR-5796: Fix illegal API call to format. (merged branch_4x revisions r1574641 and r1574682)

        Show
        ASF subversion and git services added a comment - Commit 1581195 from Steve Rowe in branch 'dev/branches/lucene_solr_4_7' [ https://svn.apache.org/r1581195 ] SOLR-5796 : Increase how long we are willing to wait for a core to see the ZK advertised leader in it's local state. SOLR-5796 : Make how long we are willing to wait for a core to see the ZK advertised leader in it's local state configurable. SOLR-5796 : Fix illegal API call to format. (merged branch_4x revisions r1574641 and r1574682)
        Hide
        ASF subversion and git services added a comment -

        Commit 1581196 from Steve Rowe in branch 'dev/trunk'
        [ https://svn.apache.org/r1581196 ]

        SOLR-5796: move CHANGES.txt entries to 4.7.1 section

        Show
        ASF subversion and git services added a comment - Commit 1581196 from Steve Rowe in branch 'dev/trunk' [ https://svn.apache.org/r1581196 ] SOLR-5796 : move CHANGES.txt entries to 4.7.1 section
        Hide
        ASF subversion and git services added a comment -

        Commit 1581198 from Steve Rowe in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1581198 ]

        SOLR-5796: move CHANGES.txt entries to 4.7.1 section (merged trunk r1581196)

        Show
        ASF subversion and git services added a comment - Commit 1581198 from Steve Rowe in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1581198 ] SOLR-5796 : move CHANGES.txt entries to 4.7.1 section (merged trunk r1581196)
        Hide
        Steve Rowe added a comment -

        Bulk close 4.7.1 issues

        Show
        Steve Rowe added a comment - Bulk close 4.7.1 issues
        Hide
        Rich Mayfield added a comment -

        I ran into this problem also (we have upwards of 1000 cores in each replica).

        I greatly appreciate the fix and the advice about rolling the overseer last. Can you point me to any other recommendations and caveats for doing a rolling restart?

        Show
        Rich Mayfield added a comment - I ran into this problem also (we have upwards of 1000 cores in each replica). I greatly appreciate the fix and the advice about rolling the overseer last. Can you point me to any other recommendations and caveats for doing a rolling restart?

          People

          • Assignee:
            Mark Miller
            Reporter:
            Timothy Potter
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development