Solr
  1. Solr
  2. SOLR-3939

An empty or just replicated index cannot become the leader of a shard after a leader goes down.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA, 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: SolrCloud
    • Labels:

      Description

      When a leader core is unloaded using the core admin api, the followers in the shard go into recovery but do not come out. Leader election doesn't take place and the shard goes down.

      This effects the ability to move a micro-shard from one Solr instance to another Solr instance.

      The problem does not occur 100% of the time but a large % of the time.

      To setup a test, startup Solr Cloud with a single shard. Add cores to that shard as replicas using core admin. Then unload the leader core using core admin.

      1. cloud.log
        14 kB
        Joel Bernstein
      2. cloud2.log
        25 kB
        Joel Bernstein
      3. SOLR-3939.patch
        11 kB
        Mark Miller
      4. SOLR-3939.patch
        11 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          I think I see two issues so far:

          1. SOLR-3940 - there can be a long wait that should not exist
          2. We should consider a sync attempt from leader to replica that fails due to 404 a success. That is either a core that has been unloaded or a starting or stopping Solr instance - treating it as a fail in the unloaded core (404) case can cause our current leader choice strategy to fail to make progress. A stopping or starting Solr instance will move on to recovery.

          Show
          Mark Miller added a comment - I think I see two issues so far: 1. SOLR-3940 - there can be a long wait that should not exist 2. We should consider a sync attempt from leader to replica that fails due to 404 a success. That is either a core that has been unloaded or a starting or stopping Solr instance - treating it as a fail in the unloaded core (404) case can cause our current leader choice strategy to fail to make progress. A stopping or starting Solr instance will move on to recovery.
          Hide
          Mark Miller added a comment -

          Patch attached.

          Treat 404 on leader sync to replicas as success + test.

          Also includes fix for SOLR-3940.

          Show
          Mark Miller added a comment - Patch attached. Treat 404 on leader sync to replicas as success + test. Also includes fix for SOLR-3940 .
          Hide
          Joel Bernstein added a comment -

          The patch solved the issue. I was able to unload the leader core and one of the replicas became the leader without delay. Tested three times, worked each time. No error in the logs.

          Great job! Thanks!

          Show
          Joel Bernstein added a comment - The patch solved the issue. I was able to unload the leader core and one of the replicas became the leader without delay. Tested three times, worked each time. No error in the logs. Great job! Thanks!
          Hide
          Mark Miller added a comment -

          Thanks Joel!

          Show
          Mark Miller added a comment - Thanks Joel!
          Hide
          Joel Bernstein added a comment -

          I think we need to re-open this issue.

          I tried unloading the shard leader when a replica is in another Solr instance and the leader election didn't take place. The initial test had the shard leader and replica in the same Solr instance, which works with this patch.

          Here is how to setup the test:

          1) Start initial solr instance, automatically creating collection1 and shard1.
          java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -jar start.jar

          2) Add shard2 to the same solr instance using coreadmin.

          http://localhost:8983/solr/admin/cores?action=CREATE&name=mycore&collection=collection1&shard=shard2

          3) Feed exampledocs to collection1.

          4) Startup another solr instance and point to same zookeeper. This will create a replica for shard1 and replicate the data from shard1.

          5) Unload the shard1 leader (core collection1) on the first solr instance.

          http://localhost:8983/solr/admin/cores?action=UNLOAD&core=collection1

          The leader election process doesn't take place.

          This would be the basic scenario for creating a micro-shard and then migrating it to another Solr instance.

          Show
          Joel Bernstein added a comment - I think we need to re-open this issue. I tried unloading the shard leader when a replica is in another Solr instance and the leader election didn't take place. The initial test had the shard leader and replica in the same Solr instance, which works with this patch. Here is how to setup the test: 1) Start initial solr instance, automatically creating collection1 and shard1. java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -jar start.jar 2) Add shard2 to the same solr instance using coreadmin. http://localhost:8983/solr/admin/cores?action=CREATE&name=mycore&collection=collection1&shard=shard2 3) Feed exampledocs to collection1. 4) Startup another solr instance and point to same zookeeper. This will create a replica for shard1 and replicate the data from shard1. 5) Unload the shard1 leader (core collection1) on the first solr instance. http://localhost:8983/solr/admin/cores?action=UNLOAD&core=collection1 The leader election process doesn't take place. This would be the basic scenario for creating a micro-shard and then migrating it to another Solr instance.
          Hide
          Mark Miller added a comment -

          Thanks for the report - we don't have much testing around unload with SolrCloud yet - this is good stuff to address.

          I've setup a test that seems to show an issue. I'll try and look at it some more tomorrow.

          Show
          Mark Miller added a comment - Thanks for the report - we don't have much testing around unload with SolrCloud yet - this is good stuff to address. I've setup a test that seems to show an issue. I'll try and look at it some more tomorrow.
          Hide
          Mark Miller added a comment -

          is this with an empty index?

          Show
          Mark Miller added a comment - is this with an empty index?
          Hide
          Joel Bernstein added a comment -

          I tested with the exampledocs loaded. Step 3 in the test above. I loaded the shards before starting up the replica on the second solr instance.

          Show
          Joel Bernstein added a comment - I tested with the exampledocs loaded. Step 3 in the test above. I loaded the shards before starting up the replica on the second solr instance.
          Hide
          Mark Miller added a comment -

          interesting - I can see an issue when I run the test with empty indexes, but my current test is passing if I add some docs. The main reason I see for this at the moment is that a leader who tries to sync with his replicas will always fail with an empty tlog (no frame of reference).

          I'll have to dig deeper for the 'docs in index' case.

          Show
          Mark Miller added a comment - interesting - I can see an issue when I run the test with empty indexes, but my current test is passing if I add some docs. The main reason I see for this at the moment is that a leader who tries to sync with his replicas will always fail with an empty tlog (no frame of reference). I'll have to dig deeper for the 'docs in index' case.
          Hide
          Joel Bernstein added a comment -

          No sure if this helps. Here is stack trace from my second solr instance. This is the instance that would be the leader after the leader core was unloaded on the first instance.

          SEVERE: There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props
          at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709)
          at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673)
          at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1070)
          at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
          at org.apache.solr.cloud.ZkController.access$100(ZkController.java:82)
          at org.apache.solr.cloud.ZkController$1.command(ZkController.java:190)
          at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:116)
          at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:46)
          at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:90)
          at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
          at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
          Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/collection1/leaders/shard1
          at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
          at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
          at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927)
          at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:244)
          at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:241)
          at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63)
          at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:241)
          at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:687)
          ... 10 more

          Oct 15, 2012 3:39:18 PM org.apache.solr.common.SolrException log
          SEVERE: :org.apache.solr.common.SolrException: There was a problem finding the leader in zk
          at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080)
          at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273)
          at org.apache.solr.cloud.ZkController.access$100(ZkController.java:82)
          at org.apache.solr.cloud.ZkController$1.command(ZkController.java:190)
          at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:116)
          at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:46)
          at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:90)
          at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
          at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

          Show
          Joel Bernstein added a comment - No sure if this helps. Here is stack trace from my second solr instance. This is the instance that would be the leader after the leader core was unloaded on the first instance. SEVERE: There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:709) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:673) at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1070) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273) at org.apache.solr.cloud.ZkController.access$100(ZkController.java:82) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:190) at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:116) at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:46) at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:90) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/collection1/leaders/shard1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:927) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:244) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:241) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:241) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:687) ... 10 more Oct 15, 2012 3:39:18 PM org.apache.solr.common.SolrException log SEVERE: :org.apache.solr.common.SolrException: There was a problem finding the leader in zk at org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1080) at org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:273) at org.apache.solr.cloud.ZkController.access$100(ZkController.java:82) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:190) at org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:116) at org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:46) at org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:90) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
          Hide
          Joel Bernstein added a comment -

          I restarted the second solr instance and it came up as the leader for shard1, with no errors.

          I'll try to re-produce again.

          Show
          Joel Bernstein added a comment - I restarted the second solr instance and it came up as the leader for shard1, with no errors. I'll try to re-produce again.
          Hide
          Joel Bernstein added a comment -

          I reproduced it again. I pulled again from the top of the 4x branch. I didn't apply the patch because it was committed, I believe.

          Same exact steps as described above. Attached is part of the log file from the second Solr instance that shows the replica going into recovery. It's looking for the collection1 core that was unloaded from the first solr instance.

          Show
          Joel Bernstein added a comment - I reproduced it again. I pulled again from the top of the 4x branch. I didn't apply the patch because it was committed, I believe. Same exact steps as described above. Attached is part of the log file from the second Solr instance that shows the replica going into recovery. It's looking for the collection1 core that was unloaded from the first solr instance.
          Hide
          Joel Bernstein added a comment -

          The log output from solr.

          Show
          Joel Bernstein added a comment - The log output from solr.
          Hide
          Joel Bernstein added a comment -

          It looks like after the leader is unloaded, the replica attempts to sync to the unloaded leader as part of the process to determine if it can be leader. When this fails, it thinks that there are better candidates to become leader. Then it goes into a recovery loop.

          Show
          Joel Bernstein added a comment - It looks like after the leader is unloaded, the replica attempts to sync to the unloaded leader as part of the process to determine if it can be leader. When this fails, it thinks that there are better candidates to become leader. Then it goes into a recovery loop.
          Hide
          Mark Miller added a comment -

          That's what I see when I have an empty index. The leader sync fails because sync always fails with no local versions.

          The case with docs is perhaps a bit trickier since my simple test passes. I'll take a look at the logs.

          Show
          Mark Miller added a comment - That's what I see when I have an empty index. The leader sync fails because sync always fails with no local versions. The case with docs is perhaps a bit trickier since my simple test passes. I'll take a look at the logs.
          Hide
          Mark Miller added a comment -

          I think I see the issue. While we have talked about it, we don't currently try to populate the transaction log after a replication.

          So, the second core replica is replicating, it's got docs but no versions, then it tries to become the leader - but just like with the empty index, it cannot successfully sync with no versions as a frame of reference.

          Show
          Mark Miller added a comment - I think I see the issue. While we have talked about it, we don't currently try to populate the transaction log after a replication. So, the second core replica is replicating, it's got docs but no versions, then it tries to become the leader - but just like with the empty index, it cannot successfully sync with no versions as a frame of reference.
          Hide
          Mark Miller added a comment -

          (My test was passing because I had the replica up initially, so it go the docs from the leader not through replication)

          Show
          Mark Miller added a comment - (My test was passing because I had the replica up initially, so it go the docs from the leader not through replication)
          Hide
          Joel Bernstein added a comment -

          So that was the difference between the initial test with all the cores in the single Solr instance. I ran this test starting up all the cores and then loaded, which hit all the transaction logs.

          Show
          Joel Bernstein added a comment - So that was the difference between the initial test with all the cores in the single Solr instance. I ran this test starting up all the cores and then loaded, which hit all the transaction logs.
          Hide
          Mark Miller added a comment -

          I'm not sure - I'd have to think about it. Could be as simple as timing. If the new leader candidate gets an updated cluster state fast enough, he won't even consider the core that just went down when trying to become the leader.

          Anyhow, for what I am seeing in my tests so far, I think have a solution that works for now - just have to test it out.

          Show
          Mark Miller added a comment - I'm not sure - I'd have to think about it. Could be as simple as timing. If the new leader candidate gets an updated cluster state fast enough, he won't even consider the core that just went down when trying to become the leader. Anyhow, for what I am seeing in my tests so far, I think have a solution that works for now - just have to test it out.
          Hide
          Yonik Seeley added a comment -

          Mark & I were chatting a little about this. The easiest fix would seem to be

          • if a single remaining replica is active, it should always become the leader (even if it has no recent versions)
          • if an existing replica comes back up and tries to sync with this new leader, it should fail (or somehow be forced to replicate from the new leader)

          That still begs the question... what if two replicas are in the situation of having no recent versions because they both just finished replicating?

          Anther solution (which is more difficult and would take longer) is to store some number of latest versions in the commit data.

          Show
          Yonik Seeley added a comment - Mark & I were chatting a little about this. The easiest fix would seem to be if a single remaining replica is active, it should always become the leader (even if it has no recent versions) if an existing replica comes back up and tries to sync with this new leader, it should fail (or somehow be forced to replicate from the new leader) That still begs the question... what if two replicas are in the situation of having no recent versions because they both just finished replicating? Anther solution (which is more difficult and would take longer) is to store some number of latest versions in the commit data.
          Hide
          Joel Bernstein added a comment -

          Curious as to what the procedure is when the leader VM is taken down, rather then the leader core is unloaded. This scenario seems to work even when the replica was just be replicated to.

          Show
          Joel Bernstein added a comment - Curious as to what the procedure is when the leader VM is taken down, rather then the leader core is unloaded. This scenario seems to work even when the replica was just be replicated to.
          Hide
          Joel Bernstein added a comment -

          Actually just tested, and it's a problem when you take down the lead VM as well. So it's not specific to unloading lead core.

          Show
          Joel Bernstein added a comment - Actually just tested, and it's a problem when you take down the lead VM as well. So it's not specific to unloading lead core.
          Hide
          Mark Miller added a comment -

          Joel: It could just be timing - live nodes are likely faster to update than the cluster state - when you just unload a core, the live node stays - so the unloaded core will only not take part in leader election when the leader candidate gets the new cluster state. When you take down a vm, the live node goes. Perhaps this is noticed faster. In either case, it's a race.

          Yonik:

          for case one, I'd like to still sync if possible - but if there are no starting versions in the updatelog, don't worry if the sync was not a success?

          I still have to add a test and fix for case 2.

          Show
          Mark Miller added a comment - Joel: It could just be timing - live nodes are likely faster to update than the cluster state - when you just unload a core, the live node stays - so the unloaded core will only not take part in leader election when the leader candidate gets the new cluster state. When you take down a vm, the live node goes. Perhaps this is noticed faster. In either case, it's a race. Yonik: for case one, I'd like to still sync if possible - but if there are no starting versions in the updatelog, don't worry if the sync was not a success? I still have to add a test and fix for case 2.
          Hide
          Mark Miller added a comment -

          Sorry - not starting versions - recent versions.

          Show
          Mark Miller added a comment - Sorry - not starting versions - recent versions.
          Hide
          Mark Miller added a comment -

          Here is a patch that should address point 1 - point 2 still has no fix or test.

          Show
          Mark Miller added a comment - Here is a patch that should address point 1 - point 2 still has no fix or test.
          Hide
          Joel Bernstein added a comment -

          I applied the patch to a fresh pull of the 4x branch.

          Then I performed the test that I detailed in the Oct 14th comment. The versions issue still causes the recovery loop. I've attached the log (cloud2.log) that shows the recovery loop.

          When I do a similar test, all in the same instance it works, but as you mentioned this is likely because of the race.

          Show
          Joel Bernstein added a comment - I applied the patch to a fresh pull of the 4x branch. Then I performed the test that I detailed in the Oct 14th comment. The versions issue still causes the recovery loop. I've attached the log (cloud2.log) that shows the recovery loop. When I do a similar test, all in the same instance it works, but as you mentioned this is likely because of the race.
          Hide
          Mark Miller added a comment -

          I've got a test for the second case yonik mentioned now.

          Show
          Mark Miller added a comment - I've got a test for the second case yonik mentioned now.
          Hide
          Yonik Seeley added a comment - - edited

          for case one, I'd like to still sync if possible

          Case #1 only had a single active replica up, so there's no one to sync with (or you mean try to sync with even shards marked as down?)

          Show
          Yonik Seeley added a comment - - edited for case one, I'd like to still sync if possible Case #1 only had a single active replica up, so there's no one to sync with (or you mean try to sync with even shards marked as down?)
          Hide
          Mark Miller added a comment -

          I mean, in the general case, don't just let someone be the leader if they are active. Make them sync and only let them be the leader if they are successful.

          However, if they have no recent versions, let them be the leader even if the sync fails. This lets us keep consistency in almost all cases. True, without the sync, there will still not be data loss, but I'd still like to try and force consistency if possible as well.

          Show
          Mark Miller added a comment - I mean, in the general case, don't just let someone be the leader if they are active. Make them sync and only let them be the leader if they are successful. However, if they have no recent versions, let them be the leader even if the sync fails. This lets us keep consistency in almost all cases. True, without the sync, there will still not be data loss, but I'd still like to try and force consistency if possible as well.
          Hide
          Yonik Seeley added a comment -

          Actually, even if we someday replicate recent versions along with the index (by adding them to the commitData in the index, etc), it may still be good to support indexes w/o version info. On the other side of the spectrum from having indexing completely automated, some people may want the ability to create a new shard off-line and then insert it into the cluster as read-only.

          Show
          Yonik Seeley added a comment - Actually, even if we someday replicate recent versions along with the index (by adding them to the commitData in the index, etc), it may still be good to support indexes w/o version info. On the other side of the spectrum from having indexing completely automated, some people may want the ability to create a new shard off-line and then insert it into the cluster as read-only.
          Hide
          Mark Miller added a comment -

          In two other issues I was working on, unrelated changes seemed to start causing test fails in one of the solrcloud tests - it's a fail I had seen sometimes in the past on Apache jenkins. A fail about waiting to notice a live node drop. It seems that was caused by this - it took some time to trace it back here. One of the nodes doesn't see a live node change because he is stuck in a leader election loop.

          Given that, I plan on committing what I have so far - so it stops blocking my other two issues. We can then iterate further on trunk.

          Show
          Mark Miller added a comment - In two other issues I was working on, unrelated changes seemed to start causing test fails in one of the solrcloud tests - it's a fail I had seen sometimes in the past on Apache jenkins. A fail about waiting to notice a live node drop. It seems that was caused by this - it took some time to trace it back here. One of the nodes doesn't see a live node change because he is stuck in a leader election loop. Given that, I plan on committing what I have so far - so it stops blocking my other two issues. We can then iterate further on trunk.
          Hide
          Mark Miller added a comment -

          Okay, I've finally got all tests passing reliably for me. I had to add SocketException to the list of exceptions that are okay to consider a peer sync success. I'll try and get this committed tonight.

          Show
          Mark Miller added a comment - Okay, I've finally got all tests passing reliably for me. I had to add SocketException to the list of exceptions that are okay to consider a peer sync success. I'll try and get this committed tonight.
          Hide
          Yonik Seeley added a comment -

          Trying to think if this could happen when there are versions too... say that instead of having no versions, we just have old versions from before we did the replication. This may argue for somehow marking the start of a replication in the transaction log and then never retrieving versions older than that.

          Show
          Yonik Seeley added a comment - Trying to think if this could happen when there are versions too... say that instead of having no versions, we just have old versions from before we did the replication. This may argue for somehow marking the start of a replication in the transaction log and then never retrieving versions older than that.
          Hide
          Yonik Seeley added a comment -

          Thinking of some scenarios where this could happen:

          1. R1,R2 both up and active, add docs 1,2,3
          2. bring R2 down
          3. add docs 4 through 1million
          4. bring R2 up, peersync fails, replication is kicked off
          5. R2 finishes replication and becomes active, but it's recent version still list 1,2,3
          6. bring R1 down, R2 becomes the leader
          7. bring R2 up, it does a peer-sync with R1, which looks like it has really old versions (and succeeds because of that)
          8. if the leader (R2) does a peer-sync back with R1, it will fail (not sure of the consequences of this)

          Another variation... if there's an update between 6 and 7:
          6.5. add doc 1million+1

          This will cause recent versions of R2 to be 1,2,3,1000001
          It would be good to verify that peersync to the leader will either fail (causing full replication), or pick up the new document.

          Show
          Yonik Seeley added a comment - Thinking of some scenarios where this could happen: 1. R1,R2 both up and active, add docs 1,2,3 2. bring R2 down 3. add docs 4 through 1million 4. bring R2 up, peersync fails, replication is kicked off 5. R2 finishes replication and becomes active, but it's recent version still list 1,2,3 6. bring R1 down, R2 becomes the leader 7. bring R2 up, it does a peer-sync with R1, which looks like it has really old versions (and succeeds because of that) 8. if the leader (R2) does a peer-sync back with R1, it will fail (not sure of the consequences of this) Another variation... if there's an update between 6 and 7: 6.5. add doc 1million+1 This will cause recent versions of R2 to be 1,2,3,1000001 It would be good to verify that peersync to the leader will either fail (causing full replication), or pick up the new document.
          Hide
          Mark Miller added a comment -

          Currently the leader does not peer sync back to a replica coming up because it would have to buffer updates.

          I think that if a replica is somehow ahead of the leader when coming back, peersync should fail and it should replicate. I think since this is not a common case, that is much simpler than trying to peersync back from the leder to the replica in this case.

          Show
          Mark Miller added a comment - Currently the leader does not peer sync back to a replica coming up because it would have to buffer updates. I think that if a replica is somehow ahead of the leader when coming back, peersync should fail and it should replicate. I think since this is not a common case, that is much simpler than trying to peersync back from the leder to the replica in this case.
          Hide
          Yonik Seeley added a comment -

          Currently the leader does not peer sync back to a replica coming up because it would have to buffer updates.

          peer sync doesn't require buffering updates. AFAIK, we don't do that until we realize we need to replicate?

          Show
          Yonik Seeley added a comment - Currently the leader does not peer sync back to a replica coming up because it would have to buffer updates. peer sync doesn't require buffering updates. AFAIK, we don't do that until we realize we need to replicate?
          Hide
          Mark Miller added a comment -

          As far as I remember, if updates are coming in when you try and peer sync, we fail it? Isn't that what capturing the starting versions is all about?

          When a leader syncs with his replicas on leader election, we know docs are not coming in, so we don't worry about that starting versions check - but if you want to peer sync from the leader to a replica that is coming back up, if updates are coming in, you are going to force a replication anyway. Since it's already an uncommon case, it doesn't seem worth tackling. I mention buffering, because it seemed you would have to to be able to peer sync when updates are coming in (or block updates).

          Show
          Mark Miller added a comment - As far as I remember, if updates are coming in when you try and peer sync, we fail it? Isn't that what capturing the starting versions is all about? When a leader syncs with his replicas on leader election, we know docs are not coming in, so we don't worry about that starting versions check - but if you want to peer sync from the leader to a replica that is coming back up, if updates are coming in, you are going to force a replication anyway. Since it's already an uncommon case, it doesn't seem worth tackling. I mention buffering, because it seemed you would have to to be able to peer sync when updates are coming in (or block updates).
          Hide
          Mark Miller added a comment -

          I've committed my latest work to 4x Joel - can you do a bit more testing with a recent checkout?

          Show
          Mark Miller added a comment - I've committed my latest work to 4x Joel - can you do a bit more testing with a recent checkout?
          Hide
          Joel Bernstein added a comment -

          I ran the Oct 14th test and the leader election worked perfectly. Then I tested shutting down the leader VM instead of unloading the loader core and this worked fine.

          Then I tried a leader with two replicas that had both just been replicated to. When I unloaded the leader neither replica became leader. But this was the case that was not yet accounted for I believe.

          I can't think of a use case where the second scenario would happen though.

          The first scenario though is critical for migrating micro-shards, so it's great that you committed this.

          Thanks for your work on this issue.

          Joel

          Show
          Joel Bernstein added a comment - I ran the Oct 14th test and the leader election worked perfectly. Then I tested shutting down the leader VM instead of unloading the loader core and this worked fine. Then I tried a leader with two replicas that had both just been replicated to. When I unloaded the leader neither replica became leader. But this was the case that was not yet accounted for I believe. I can't think of a use case where the second scenario would happen though. The first scenario though is critical for migrating micro-shards, so it's great that you committed this. Thanks for your work on this issue. Joel
          Hide
          Yonik Seeley added a comment - - edited

          Isn't that what capturing the starting versions is all about?

          For a node starting up, yeah. For a leader syncing to someone else - I don't think it should matter.
          edit: OK - I think I got what you're saying now - if the new node coming up did have an extra doc, then the only way to guarantee the leader pick it up would be if not too many updates came in for either. We could require that a sync from the leader to the replica have the list of recent versions overlap enough (else the replica would be forced to replicate), but as you say... if updates are coming in fast enough (and that is prob pretty slow) you're going to force a replication anyway.

          but if you want to peer sync from the leader to a replica that is coming back up, if updates are coming in, you are going to force a replication anyway.

          If updates were coming in fast enough during the "bounce"... I guess so.

          Show
          Yonik Seeley added a comment - - edited Isn't that what capturing the starting versions is all about? For a node starting up, yeah. For a leader syncing to someone else - I don't think it should matter. edit: OK - I think I got what you're saying now - if the new node coming up did have an extra doc, then the only way to guarantee the leader pick it up would be if not too many updates came in for either. We could require that a sync from the leader to the replica have the list of recent versions overlap enough (else the replica would be forced to replicate), but as you say... if updates are coming in fast enough (and that is prob pretty slow) you're going to force a replication anyway. but if you want to peer sync from the leader to a replica that is coming back up, if updates are coming in, you are going to force a replication anyway. If updates were coming in fast enough during the "bounce"... I guess so.
          Hide
          Mark Miller added a comment -

          Okay, I'm going to resolve this - we can make a new issue for the case where a replica comes up and is ahead somehow.

          Show
          Mark Miller added a comment - Okay, I'm going to resolve this - we can make a new issue for the case where a replica comes up and is ahead somehow.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1397672

          SOLR-3939: Consider a sync attempt from leader to replica that fails due to 404 a success.
          SOLR-3940: Rejoining the leader election incorrectly triggers the code path for a fresh cluster start rather than fail over.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1397672 SOLR-3939 : Consider a sync attempt from leader to replica that fails due to 404 a success. SOLR-3940 : Rejoining the leader election incorrectly triggers the code path for a fresh cluster start rather than fail over.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1402362

          SOLR-3933: Distributed commits are not guaranteed to be ordered within a request.

          SOLR-3939: An empty or just replicated index cannot become the leader of a shard after a leader goes down.

          SOLR-3971: A collection that is created with numShards=1 turns into a numShards=2 collection after starting up a second core and not specifying numShards.

          SOLR-3932: SolrCmdDistributorTest either takes 3 seconds or 3 minutes.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1402362 SOLR-3933 : Distributed commits are not guaranteed to be ordered within a request. SOLR-3939 : An empty or just replicated index cannot become the leader of a shard after a leader goes down. SOLR-3971 : A collection that is created with numShards=1 turns into a numShards=2 collection after starting up a second core and not specifying numShards. SOLR-3932 : SolrCmdDistributorTest either takes 3 seconds or 3 minutes.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1402361

          SOLR-3933: Distributed commits are not guaranteed to be ordered within a request.

          SOLR-3939: An empty or just replicated index cannot become the leader of a shard after a leader goes down.

          SOLR-3971: A collection that is created with numShards=1 turns into a numShards=2 collection after starting up a second core and not specifying numShards.

          SOLR-3932: SolrCmdDistributorTest either takes 3 seconds or 3 minutes.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1402361 SOLR-3933 : Distributed commits are not guaranteed to be ordered within a request. SOLR-3939 : An empty or just replicated index cannot become the leader of a shard after a leader goes down. SOLR-3971 : A collection that is created with numShards=1 turns into a numShards=2 collection after starting up a second core and not specifying numShards. SOLR-3932 : SolrCmdDistributorTest either takes 3 seconds or 3 minutes.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1397672

          SOLR-3939: Consider a sync attempt from leader to replica that fails due to 404 a success.
          SOLR-3940: Rejoining the leader election incorrectly triggers the code path for a fresh cluster start rather than fail over.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1397672 SOLR-3939 : Consider a sync attempt from leader to replica that fails due to 404 a success. SOLR-3940 : Rejoining the leader election incorrectly triggers the code path for a fresh cluster start rather than fail over.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Joel Bernstein
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development