Solr
  1. Solr
  2. SOLR-7569

Create an API to force a leader election between nodes

    Details

      Description

      There are many reasons why Solr will not elect a leader for a shard e.g. all replicas' last published state was recovery or due to bugs which cause a leader to be marked as 'down'. While the best solution is that they never get into this state, we need a manual way to fix this when it does get into this state. Right now we can do a series of dance involving bouncing the node (since recovery paths between bouncing and REQUESTRECOVERY are different), but that is difficult when running a large cluster. Although it is possible that such a manual API may lead to some data loss but in some cases, it is the only possible option to restore availability.

      This issue proposes to build a new collection API which can be used to force replicas into recovering a leader while avoiding data loss on a best effort basis.

      1. SOLR-7569_lir_down_state_test.patch
        6 kB
        Ishan Chattopadhyaya
      2. SOLR-7569.patch
        30 kB
        Ishan Chattopadhyaya
      3. SOLR-7569.patch
        29 kB
        Ishan Chattopadhyaya
      4. SOLR-7569.patch
        29 kB
        Ishan Chattopadhyaya
      5. SOLR-7569.patch
        34 kB
        Ishan Chattopadhyaya
      6. SOLR-7569.patch
        8 kB
        Ishan Chattopadhyaya
      7. SOLR-7569.patch
        36 kB
        Ishan Chattopadhyaya
      8. SOLR-7569.patch
        35 kB
        Ishan Chattopadhyaya
      9. SOLR-7569.patch
        34 kB
        Ishan Chattopadhyaya
      10. SOLR-7569.patch
        34 kB
        Ishan Chattopadhyaya
      11. SOLR-7569.patch
        32 kB
        Ishan Chattopadhyaya
      12. SOLR-7569.patch
        18 kB
        Ishan Chattopadhyaya
      13. SOLR-7569.patch
        18 kB
        Ishan Chattopadhyaya
      14. SOLR-7569.patch
        15 kB
        Ishan Chattopadhyaya
      15. SOLR-7569.patch
        15 kB
        Ishan Chattopadhyaya
      16. SOLR-7569.patch
        21 kB
        Ishan Chattopadhyaya
      17. SOLR-7569.patch
        20 kB
        Ishan Chattopadhyaya
      18. SOLR-7569-testfix.patch
        2 kB
        Ishan Chattopadhyaya

        Issue Links

          Activity

          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Trying to tackle this situation where all replicas (including the leader) are somehow marked "down" (maybe due to bugs?) and there is no leader in the shard, and hence the entire shard is down. Adding a new collection API "RECOVERSHARD".

          In this strawman patch, I am evaluating the following approach:

          • Remove all leader initiated recovery flags for this shard. [TODO]
          • Pick the next leader: If the leader election queue is not empty and the first replica in the queue is on a live node, choose the replica as the next leader. Otherwise, pick a random replica, which is on a live node, to become the next leader (TODO: we can have the user specify which replica he/she wants as the next leader).
          • If the chosen leader is not at the head of the leader election queue, have it join the election at the head (similar to what REBALANCELEADERS tries to do). [TODO]
          • Mark the chosen next leader as "active". Mark rest of the replicas, which are on live nodes, as "recovering".
          • Issue core admin REQUESTRECOVERY command to all the replicas marked "recovering".
          • Wait till recovery completes. [TODO]

          Does the above approach sound reasonable? Does the general path taken in the patch seem reasonable?

          Show
          Ishan Chattopadhyaya added a comment - - edited Trying to tackle this situation where all replicas (including the leader) are somehow marked "down" (maybe due to bugs?) and there is no leader in the shard, and hence the entire shard is down. Adding a new collection API "RECOVERSHARD". In this strawman patch, I am evaluating the following approach: Remove all leader initiated recovery flags for this shard. [TODO] Pick the next leader: If the leader election queue is not empty and the first replica in the queue is on a live node, choose the replica as the next leader. Otherwise, pick a random replica, which is on a live node, to become the next leader (TODO: we can have the user specify which replica he/she wants as the next leader). If the chosen leader is not at the head of the leader election queue, have it join the election at the head (similar to what REBALANCELEADERS tries to do). [TODO] Mark the chosen next leader as "active". Mark rest of the replicas, which are on live nodes, as "recovering". Issue core admin REQUESTRECOVERY command to all the replicas marked "recovering". Wait till recovery completes. [TODO] Does the above approach sound reasonable? Does the general path taken in the patch seem reasonable?
          Hide
          Mark Miller added a comment -

          maybe due to bugs?

          The current design allows for this.

          See SOLR-7034 and SOLR-7065 as possible improvement steps.

          This is kind of a hack solution to a current production problem or to force a leader election even when we know it probably means data loss, those issues are closer to what is supposed to come next in terms of improving the current design.

          I may have also just seen a bug where LIR info in ZK prevents anyone from becoming the leader even on full restart. SOLR-7065 should address those kinds of bugs if done right.

          Show
          Mark Miller added a comment - maybe due to bugs? The current design allows for this. See SOLR-7034 and SOLR-7065 as possible improvement steps. This is kind of a hack solution to a current production problem or to force a leader election even when we know it probably means data loss, those issues are closer to what is supposed to come next in terms of improving the current design. I may have also just seen a bug where LIR info in ZK prevents anyone from becoming the leader even on full restart. SOLR-7065 should address those kinds of bugs if done right.
          Hide
          Ishan Chattopadhyaya added a comment -

          Thanks Mark Miller for the pointer to the issues.
          I agree this is a sledge hammer to undo the effects of bugs, which shouldn't be needed if we go by an improved design. We have observed the effects of these bugs in production clusters of our clients, and this is to help them in such a scenario. Do you think we should continue down this sledge hammer path, parallel to fixing the bugs?

          Show
          Ishan Chattopadhyaya added a comment - Thanks Mark Miller for the pointer to the issues. I agree this is a sledge hammer to undo the effects of bugs, which shouldn't be needed if we go by an improved design. We have observed the effects of these bugs in production clusters of our clients, and this is to help them in such a scenario. Do you think we should continue down this sledge hammer path, parallel to fixing the bugs?
          Hide
          Mark Miller added a comment -

          Yes, I think having this option is useful in the short term and in the longer term. The system will generally refuse to continue on if it thinks it may have data loss and stopping could allow a user to possibly recover that data. This could act as a way for a user to override that.

          Show
          Mark Miller added a comment - Yes, I think having this option is useful in the short term and in the longer term. The system will generally refuse to continue on if it thinks it may have data loss and stopping could allow a user to possibly recover that data. This could act as a way for a user to override that.
          Hide
          Erick Erickson added a comment -

          bq: If the chosen leader is not at the head of the leader election queue, have it join the election at the head (similar to what REBALANCELEADERS tries to do).

          Be really careful if you are trying to manipulate the leader election stuff, it's very easy to get wrong. Or at least it was last time I looked, perhaps it's changed a lot since then. I'd be glad to chat about what I remember if you'd like.

          Show
          Erick Erickson added a comment - bq: If the chosen leader is not at the head of the leader election queue, have it join the election at the head (similar to what REBALANCELEADERS tries to do). Be really careful if you are trying to manipulate the leader election stuff, it's very easy to get wrong. Or at least it was last time I looked, perhaps it's changed a lot since then. I'd be glad to chat about what I remember if you'd like.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updated patch with some cleanup. Still some TODOs, nocommits to go.

          Show
          Ishan Chattopadhyaya added a comment - Updated patch with some cleanup. Still some TODOs, nocommits to go.
          Hide
          Varun Thacker added a comment -

          Pick the next leader: If the leader election queue is not empty and the first replica in the queue is on a live node, choose the replica as the next leader. Otherwise, pick a random replica, which is on a live node, to become the next leader (TODO: we can have the user specify which replica he/she wants as the next leader).

          Maybe pick the leader amongst the replicas which has the latest commit timestamp?

          Show
          Varun Thacker added a comment - Pick the next leader: If the leader election queue is not empty and the first replica in the queue is on a live node, choose the replica as the next leader. Otherwise, pick a random replica, which is on a live node, to become the next leader (TODO: we can have the user specify which replica he/she wants as the next leader). Maybe pick the leader amongst the replicas which has the latest commit timestamp?
          Hide
          Mark Miller added a comment -

          I don't really like the idea of choosing a leader. It seems to me this feature should force a new election and address the state that prevents someone from becoming leader somehow. You still want the sync stage and the system to pick the best leader though. This should just get you out of the state that is preventing a leader from being elected.

          Show
          Mark Miller added a comment - I don't really like the idea of choosing a leader. It seems to me this feature should force a new election and address the state that prevents someone from becoming leader somehow. You still want the sync stage and the system to pick the best leader though. This should just get you out of the state that is preventing a leader from being elected.
          Hide
          Ishan Chattopadhyaya added a comment -

          I just tried to simulate the scenario where all the replicas are in down state due to LIR, and there is no leader. In this state, the leader election queue is empty.

          So, I am thinking of some way to have the replicas (that are on live nodes) to join the leader election. Is there any clean way of doing that, short of a core reload?

          Show
          Ishan Chattopadhyaya added a comment - I just tried to simulate the scenario where all the replicas are in down state due to LIR, and there is no leader. In this state, the leader election queue is empty. So, I am thinking of some way to have the replicas (that are on live nodes) to join the leader election. Is there any clean way of doing that, short of a core reload?
          Hide
          Ishan Chattopadhyaya added a comment -

          A patch containing the test that simulates the state described above.

          Show
          Ishan Chattopadhyaya added a comment - A patch containing the test that simulates the state described above.
          Hide
          Ishan Chattopadhyaya added a comment -

          > In this state, the leader election queue is empty.
          Ignore that, I was catching that state before the replicas had a chance to rejoin the election. The last assert in the patch is inappropriate.

          Show
          Ishan Chattopadhyaya added a comment - > In this state, the leader election queue is empty. Ignore that, I was catching that state before the replicas had a chance to rejoin the election. The last assert in the patch is inappropriate.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          This should just get you out of the state that is preventing a leader from being elected.

          Updating the patch that attempts to do just that.

          • Clear LIR znodes
          • Unset any existing leader flags.
          • Mark all nodes as active
          • Wait for leader election so that normal state is restored

          (I also tried to mark just one of the replicas as active instead of all the replicas, hoping it will become leader and others would recover from it. However, this resulted in one of the other down replicas becoming leader but still staying down. Looking into why that could be happening; bug?)

          Show
          Ishan Chattopadhyaya added a comment - - edited This should just get you out of the state that is preventing a leader from being elected. Updating the patch that attempts to do just that. Clear LIR znodes Unset any existing leader flags. Mark all nodes as active Wait for leader election so that normal state is restored (I also tried to mark just one of the replicas as active instead of all the replicas, hoping it will become leader and others would recover from it. However, this resulted in one of the other down replicas becoming leader but still staying down. Looking into why that could be happening; bug?)
          Hide
          Ishan Chattopadhyaya added a comment -

          Adding a wait for recoveries to finish after the recovery operation in the test.

          Show
          Ishan Chattopadhyaya added a comment - Adding a wait for recoveries to finish after the recovery operation in the test.
          Hide
          Ishan Chattopadhyaya added a comment -

          Adding more logging for tests, tests for asserting indexing fails during down state and works again after the recover operation.

          Show
          Ishan Chattopadhyaya added a comment - Adding more logging for tests, tests for asserting indexing fails during down state and works again after the recover operation.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Ishan! A few comments:

          1. nit - RecoverShardTest has an unused notLeader1 variable
          2. Shouldn't the "Wait for a long time for a steady state" piece of code be before the proxies for the two replicas are reopened? The LIR state will surely be set at indexing time and only if the proxy is closed. Also if you move that wait before the proxy is reopened then you are sure to have the LIR state as 'down'.
          3. The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done after force refreshing the cluster state of the cloudClient otherwise spurious failures can happen
          4. nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just the max retries has been increased and wait between retries has been decreased
          5. The OCMH.recoverShard() isn't unsetting the leader properly. It should be as simple as:
            ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, OverseerAction.LEADER.toLower(),
                      ZkStateReader.SHARD_ID_PROP, shardId, ZkStateReader.COLLECTION_PROP, collection);
                  Overseer.getInQueue(zkClient).offer(Utils.toJSON(m));
            
          6. Can you please write a test to ensure that this API works with 'async' parameter?

          I think some simple scenarios are not being taken care of. This command only helps if there a LIR node exists but we can do a bit more:

          1. Leader is live but 'down' -> mark it 'active'
          2. Leader itself is in LIR -> delete the LIR node
          3. Leader is not live:
            1. Replicas are live but 'down' or 'recovering' -> mark them 'active'
            2. Replicas are live but in LIR -> delete the LIR nodes

          Can you please add some tests exercising each of the above scenarios?

          I also tried to mark just one of the replicas as active instead of all the replicas, hoping it will become leader and others would recover from it. However, this resulted in one of the other down replicas becoming leader but still staying down. Looking into why that could be happening; bug?

          Did you find out why/how that happened? If this is reproducible, can you please create an issue and post the test there?

          Show
          Shalin Shekhar Mangar added a comment - Thanks Ishan! A few comments: nit - RecoverShardTest has an unused notLeader1 variable Shouldn't the "Wait for a long time for a steady state" piece of code be before the proxies for the two replicas are reopened? The LIR state will surely be set at indexing time and only if the proxy is closed. Also if you move that wait before the proxy is reopened then you are sure to have the LIR state as 'down'. The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done after force refreshing the cluster state of the cloudClient otherwise spurious failures can happen nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just the max retries has been increased and wait between retries has been decreased The OCMH.recoverShard() isn't unsetting the leader properly. It should be as simple as: ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, OverseerAction.LEADER.toLower(), ZkStateReader.SHARD_ID_PROP, shardId, ZkStateReader.COLLECTION_PROP, collection); Overseer.getInQueue(zkClient).offer(Utils.toJSON(m)); Can you please write a test to ensure that this API works with 'async' parameter? I think some simple scenarios are not being taken care of. This command only helps if there a LIR node exists but we can do a bit more: Leader is live but 'down' -> mark it 'active' Leader itself is in LIR -> delete the LIR node Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active' Replicas are live but in LIR -> delete the LIR nodes Can you please add some tests exercising each of the above scenarios? I also tried to mark just one of the replicas as active instead of all the replicas, hoping it will become leader and others would recover from it. However, this resulted in one of the other down replicas becoming leader but still staying down. Looking into why that could be happening; bug? Did you find out why/how that happened? If this is reproducible, can you please create an issue and post the test there?
          Hide
          Ishan Chattopadhyaya added a comment -

          1. nit - RecoverShardTest has an unused notLeader1 variable

          Thanks. Made some refactoring to the test and this has gone away now.

          2. Shouldn't the "Wait for a long time for a steady state" piece of code be before the proxies for the two replicas are reopened? The LIR state will surely be set at indexing time and only if the proxy is closed. Also if you move that wait before the proxy is reopened then you are sure to have the LIR state as 'down'.

          This makes sense, I've made the change.

          3. The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done after force refreshing the cluster state of the cloudClient otherwise spurious failures can happen

          I didn't know about this force update of the cluster state; I've now added it.

          4. nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just the max retries has been increased and wait between retries has been decreased

          The tests were (and still are) taking too long, and reducing the wait from 30sec to 1sec was helpful.

          5. The OCMH.recoverShard() isn't unsetting the leader properly. It should be as simple as:

          Thanks, I've cleaned this up.

          6. Can you please write a test to ensure that this API works with 'async' parameter?

          TODO.

          Leader is live but 'down' -> mark it 'active'

          This works now. Added testLeaderDown() method.

          Leader itself is in LIR -> delete the LIR node

          This should work, since the API method first clears the LIR state. Couldn't add a test for this, since I couldn't simulate this state in a test.

          Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active'

          This works now. Added testAllReplicasDownNoLeader() method.

          Leader is not live: Replicas are live but in LIR -> delete the LIR nodes

          This works as last patch. The corresponding test is now at testReplicasInLIRNoLeader().

          Did you find out why/how that happened? If this is reproducible, can you please create an issue and post the test there?

          Added SOLR-7989 for this, will look deeper soon.

          Show
          Ishan Chattopadhyaya added a comment - 1. nit - RecoverShardTest has an unused notLeader1 variable Thanks. Made some refactoring to the test and this has gone away now. 2. Shouldn't the "Wait for a long time for a steady state" piece of code be before the proxies for the two replicas are reopened? The LIR state will surely be set at indexing time and only if the proxy is closed. Also if you move that wait before the proxy is reopened then you are sure to have the LIR state as 'down'. This makes sense, I've made the change. 3. The check for 'numActiveReplicas' and 'numReplicasOnLiveNodes' should be done after force refreshing the cluster state of the cloudClient otherwise spurious failures can happen I didn't know about this force update of the cluster state; I've now added it. 4. nit - Why is sendDoc overridden in RecoverShardTest? The minRf is same, just the max retries has been increased and wait between retries has been decreased The tests were (and still are) taking too long, and reducing the wait from 30sec to 1sec was helpful. 5. The OCMH.recoverShard() isn't unsetting the leader properly. It should be as simple as: Thanks, I've cleaned this up. 6. Can you please write a test to ensure that this API works with 'async' parameter? TODO. Leader is live but 'down' -> mark it 'active' This works now. Added testLeaderDown() method. Leader itself is in LIR -> delete the LIR node This should work, since the API method first clears the LIR state. Couldn't add a test for this, since I couldn't simulate this state in a test. Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active' This works now. Added testAllReplicasDownNoLeader() method. Leader is not live: Replicas are live but in LIR -> delete the LIR nodes This works as last patch. The corresponding test is now at testReplicasInLIRNoLeader(). Did you find out why/how that happened? If this is reproducible, can you please create an issue and post the test there? Added SOLR-7989 for this, will look deeper soon.
          Hide
          Mark Miller added a comment -

          This seems to have some overlap with SOLR-6236 based on the comments. Tim did some work here as well:

          At a high-level, the issue boils down to giving SolrCloud operators a way to either a) manually force a leader to be elected, or b) set an optional configuration property that triggers the force leader behavior after seeing so many failed recoveries due to no leader. So this can be considered an optional availablity-over-consistency mode with respect to leader-failover.

          Show
          Mark Miller added a comment - This seems to have some overlap with SOLR-6236 based on the comments. Tim did some work here as well: At a high-level, the issue boils down to giving SolrCloud operators a way to either a) manually force a leader to be elected, or b) set an optional configuration property that triggers the force leader behavior after seeing so many failed recoveries due to no leader. So this can be considered an optional availablity-over-consistency mode with respect to leader-failover.
          Hide
          Ishan Chattopadhyaya added a comment -

          At a high-level, the issue boils down to giving SolrCloud operators a way to either a) manually force a leader to be elected, or b) set an optional configuration property that triggers the force leader behavior after seeing so many failed recoveries due to no leader.

          I think the difference is that in this issue, we're just trying to (manually) clean up the LIR state and mark affected down replicas as active and hope that normal leader election is initiated and normalcy is restored. Based on initial glance at SOLR-6236, it seems that the intention there is to actually force one of the replicas to become a leader (either manually or automatically).

          I am not sure which path we should take, but it seems the approach taken here is less intrusive/safer, if/when it works.

          Show
          Ishan Chattopadhyaya added a comment - At a high-level, the issue boils down to giving SolrCloud operators a way to either a) manually force a leader to be elected, or b) set an optional configuration property that triggers the force leader behavior after seeing so many failed recoveries due to no leader. I think the difference is that in this issue, we're just trying to (manually) clean up the LIR state and mark affected down replicas as active and hope that normal leader election is initiated and normalcy is restored. Based on initial glance at SOLR-6236 , it seems that the intention there is to actually force one of the replicas to become a leader (either manually or automatically). I am not sure which path we should take, but it seems the approach taken here is less intrusive/safer, if/when it works.
          Hide
          Mark Miller added a comment -

          Timothy Potter, what is your impression?

          Show
          Mark Miller added a comment - Timothy Potter , what is your impression?
          Hide
          Mark Miller added a comment -

          we're just trying to (manually) clean up the LIR state and mark affected down replicas as active and hope that normal leader election is initiated and normalcy is restored.

          I like that approach too, but I want to make sure we consider SOLR-6236.

          Show
          Mark Miller added a comment - we're just trying to (manually) clean up the LIR state and mark affected down replicas as active and hope that normal leader election is initiated and normalcy is restored. I like that approach too, but I want to make sure we consider SOLR-6236 .
          Hide
          Ishan Chattopadhyaya added a comment -

          This seems to have some overlap with SOLR-6236 based on the comments.

          I think I should've posted the patches there, instead of here. Seems like I'm trying to solve the same problem here (albeit, in a different way).

          Show
          Ishan Chattopadhyaya added a comment - This seems to have some overlap with SOLR-6236 based on the comments. I think I should've posted the patches there, instead of here. Seems like I'm trying to solve the same problem here (albeit, in a different way).
          Hide
          Timothy Potter added a comment -

          Hi, will dig into this in detail later today, sorry for the delay (been on another project ) ...

          Show
          Timothy Potter added a comment - Hi, will dig into this in detail later today, sorry for the delay (been on another project ) ...
          Hide
          Ishan Chattopadhyaya added a comment -
          • Passing the async parameter through,
          • Tests now randomly make async requests for the recover shard API call.
          Show
          Ishan Chattopadhyaya added a comment - Passing the async parameter through, Tests now randomly make async requests for the recover shard API call.
          Hide
          Timothy Potter added a comment -

          Looks good Ishan. Sorry for the delay getting a review done. In putNonLeadersIntoLIR, you probably want to wait a little bit before killing the leader after sending doc #2 to give the leader time to put the replicas into LIR; this works quickly on our local workstations but can take a little more time on Jenkins.

          I'm also wondering if you should bring the original downed leader back into the mix (the one that got killed in the putNonLeadersIntoLIR method) in the testReplicasInLIRNoLeader test after the new leader is selected and see what state it comes back to. Also, try sending another doc #5 once the Jetty hosting the original leader is back online.

          Lastly, what happened to the idea of allowing the user to pick the leader as part of the recover shard request? I read the comments above and agree that just triggering a re-election is preferred, but sometimes us humans actually know which replica is best. It seems reasonable to me to accept an optional parameter that specifies the replica that should be selected. However, if others don't like that idea, then I'm fine with this for now.

          Show
          Timothy Potter added a comment - Looks good Ishan. Sorry for the delay getting a review done. In putNonLeadersIntoLIR, you probably want to wait a little bit before killing the leader after sending doc #2 to give the leader time to put the replicas into LIR; this works quickly on our local workstations but can take a little more time on Jenkins. I'm also wondering if you should bring the original downed leader back into the mix (the one that got killed in the putNonLeadersIntoLIR method) in the testReplicasInLIRNoLeader test after the new leader is selected and see what state it comes back to. Also, try sending another doc #5 once the Jetty hosting the original leader is back online. Lastly, what happened to the idea of allowing the user to pick the leader as part of the recover shard request? I read the comments above and agree that just triggering a re-election is preferred, but sometimes us humans actually know which replica is best. It seems reasonable to me to accept an optional parameter that specifies the replica that should be selected. However, if others don't like that idea, then I'm fine with this for now.
          Hide
          Mark Miller added a comment -

          what happened to the idea of allowing the user to pick the leader as part of the recover shard request?

          As long as it's optional and documented so that users understand the risks, it's probably okay. But, I think in most cases the system will beat most users in most cases in understanding who should really be the leader.

          Show
          Mark Miller added a comment - what happened to the idea of allowing the user to pick the leader as part of the recover shard request? As long as it's optional and documented so that users understand the risks, it's probably okay. But, I think in most cases the system will beat most users in most cases in understanding who should really be the leader.
          Hide
          Ishan Chattopadhyaya added a comment -

          Timothy Potter Thanks for your review. I've added an extra wait before killing the leader. I tried to bring back a killed leader, but seems not to be working. Tried many things, but was stuck with "address already in use" exception, perhaps due to the socket proxy holding on the port.

          Show
          Ishan Chattopadhyaya added a comment - Timothy Potter Thanks for your review. I've added an extra wait before killing the leader. I tried to bring back a killed leader, but seems not to be working. Tried many things, but was stuck with "address already in use" exception, perhaps due to the socket proxy holding on the port.
          Hide
          Ishan Chattopadhyaya added a comment -

          what happened to the idea of allowing the user to pick the leader as part of the recover shard request?

          I had a look at your patch for SOLR-6236, I think we can tackle this using that approach. At this point, I'm inclined to keep this patch at this and tackle it separately. Most likely, the system will pick a reasonable leader, and it will sync with other replicas and the shard will be restored.

          Show
          Ishan Chattopadhyaya added a comment - what happened to the idea of allowing the user to pick the leader as part of the recover shard request? I had a look at your patch for SOLR-6236 , I think we can tackle this using that approach. At this point, I'm inclined to keep this patch at this and tackle it separately. Most likely, the system will pick a reasonable leader, and it will sync with other replicas and the shard will be restored.
          Hide
          Mark Miller added a comment -

          I wonder if Recover is the right terminology. It seems so broad and "fix anything" like. Perhaps it should be something close to 'forceleader' - something that is specific about what is happening and gives an idea that you are overriding the system as you are.

          Show
          Mark Miller added a comment - I wonder if Recover is the right terminology. It seems so broad and "fix anything" like. Perhaps it should be something close to 'forceleader' - something that is specific about what is happening and gives an idea that you are overriding the system as you are.
          Hide
          Ishan Chattopadhyaya added a comment -

          I had the same dilemma while naming this. Recover does seem like it will fix things if anything is broken, which can be misleading since at this time we aren't doing anything other than helping fix the LIR state to bring the shard back up.
          On the other hand, I am not sure about force leader, because we aren't really forcing a leader, but just paving things for an election to happen. I'm really not totally sure either way.

          How about keeping this as recover shard, documenting this as an advanced API which can potentially cause data loss, and then later add whatever else we need to recover the system from to this API itself?

          Show
          Ishan Chattopadhyaya added a comment - I had the same dilemma while naming this. Recover does seem like it will fix things if anything is broken, which can be misleading since at this time we aren't doing anything other than helping fix the LIR state to bring the shard back up. On the other hand, I am not sure about force leader, because we aren't really forcing a leader, but just paving things for an election to happen. I'm really not totally sure either way. How about keeping this as recover shard, documenting this as an advanced API which can potentially cause data loss, and then later add whatever else we need to recover the system from to this API itself?
          Hide
          Mark Miller added a comment -

          That sounds reasonable as long as we have good doc warning about it.

          Show
          Mark Miller added a comment - That sounds reasonable as long as we have good doc warning about it.
          Hide
          Mark Miller added a comment -

          because we aren't really forcing a leader, but just paving things for an election to happen.

          I guess it comes down to how you want to think about. When you use this, it will be because the system is blocking a leader from taking over. By running this API command, you remove the blocks, thus 'forcing' a leader the system would not normally pick - or at least attempting to force a leader the system would not really pick. It depends on if you want to get bogged down in implementation or design.

          I think your proposal is fine though.

          Show
          Mark Miller added a comment - because we aren't really forcing a leader, but just paving things for an election to happen. I guess it comes down to how you want to think about. When you use this, it will be because the system is blocking a leader from taking over. By running this API command, you remove the blocks, thus 'forcing' a leader the system would not normally pick - or at least attempting to force a leader the system would not really pick. It depends on if you want to get bogged down in implementation or design. I think your proposal is fine though.
          Hide
          Timothy Potter added a comment -

          +1 on FORCE_LEADER for the name of the action.

          was stuck with "address already in use" exception

          You should have access to the SocketProxy if you need to close it down before trying to restart the original leader's Jetty. If not, we should fix that.

          I'm inclined to keep this patch at this and tackle it separately

          sounds good ... may not ever be needed in the wild with the solution you've created here

          Show
          Timothy Potter added a comment - +1 on FORCE_LEADER for the name of the action. was stuck with "address already in use" exception You should have access to the SocketProxy if you need to close it down before trying to restart the original leader's Jetty. If not, we should fix that. I'm inclined to keep this patch at this and tackle it separately sounds good ... may not ever be needed in the wild with the solution you've created here
          Hide
          Ishan Chattopadhyaya added a comment -

          When you use this, it will be because the system is blocking a leader from taking over. By running this API command, you remove the blocks, thus 'forcing' a leader the system would not normally pick - or at least attempting to force a leader the system would not really pick.

          That makes sense! I've renamed this to FORCELEADER now.
          Also managed to reincarnate the killed off leaders into the tests.

          Show
          Ishan Chattopadhyaya added a comment - When you use this, it will be because the system is blocking a leader from taking over. By running this API command, you remove the blocks, thus 'forcing' a leader the system would not normally pick - or at least attempting to force a leader the system would not really pick. That makes sense! I've renamed this to FORCELEADER now. Also managed to reincarnate the killed off leaders into the tests.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Ishan.

          1. ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough?
          2. Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state.
          3. Can you please add some javadocs on the various test methods describing the scenario that they are test?
          4. minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages.
          5. In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO.
            // Clear out any LIR state
                  String lirPath = overseer.getZkController().getLeaderInitiatedRecoveryZnodePath(collection, sliceId);
                  if (lirPath != null && zkStateReader.getZkClient().exists(lirPath, true)) {
                    StringBuilder sb = new StringBuilder();
                    zkStateReader.getZkClient().printLayout(lirPath, 4, sb);
                    log.info("Cleaning out LIR data, which was: " + sb);
                    zkStateReader.getZkClient().clean(lirPath);
                  }
            
          6. There's no need to send an empty string as the role while publishing the state of the replica.
          7. minor nit - you can compare enums directly using == instead of .equals
          8. Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time?
            // If we still don't have an active leader by now, it maybe possible that the replica at the head of the election queue
                  // was the leader at some point and never left the queue, but got marked as down. So, if the election queue is not empty,
                  // and the replica at the head of the queue is live, then mark it as a leader.
            
          Show
          Shalin Shekhar Mangar added a comment - Thanks Ishan. ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough? Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state. Can you please add some javadocs on the various test methods describing the scenario that they are test? minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages. In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO. // Clear out any LIR state String lirPath = overseer.getZkController().getLeaderInitiatedRecoveryZnodePath(collection, sliceId); if (lirPath != null && zkStateReader.getZkClient().exists(lirPath, true )) { StringBuilder sb = new StringBuilder(); zkStateReader.getZkClient().printLayout(lirPath, 4, sb); log.info( "Cleaning out LIR data, which was: " + sb); zkStateReader.getZkClient().clean(lirPath); } There's no need to send an empty string as the role while publishing the state of the replica. minor nit - you can compare enums directly using == instead of .equals Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time? // If we still don't have an active leader by now, it maybe possible that the replica at the head of the election queue // was the leader at some point and never left the queue, but got marked as down. So, if the election queue is not empty, // and the replica at the head of the queue is live, then mark it as a leader.
          Hide
          Ishan Chattopadhyaya added a comment -

          Thanks Shalin for looking into the patch and your review.

          ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough?

          Fixed. This was a left over from some previous patch. I think I wanted to put the waitForRecoveriesToFinish(), but forgot to remove the 5 second sleep.

          Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state.

          In this case, waiting those 15 seconds results in one of the down replicas to become a leader (but stay down). This is the situation I'm using FORCELEADER to recover from. Instead of waiting 15 seconds, I've added some polling with wait to wake up earlier if needed, while increasing the timeout from 15s to 25s.

          Can you please add some javadocs on the various test methods describing the scenario that they are test?

          Sure, added.

          minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages.

          Used assertEquals() now.

          In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO.

          Thanks for the pointer, I've removed the null check. I feel this should be INFO instead of DEBUG, so that if a user says I issued FORCELEADER but still nothing worked for him, his logs would help us understand if we ever had any LIR state which was cleared out. But, please feel free to remove it if this doesn't make sense.

          minor nit - you can compare enums directly using == instead of .equals

          Fixed.

          Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time?

          I modified the comment text to make it more clear. This is for the situation when all replicas are (somehow, due to bug maybe?) down/recovering (but not in LIR), and there is no leader, even though many replicas are on live; I don't know if this ever happens (the LIR case happens, I know). The testAllReplicasDownNoLeader test exercises this scenario. This is more or less the scenario that you described (with one difference that there is no leader as well): Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active'.

          As you point out, I think it can indeed interfere with any on-going leader election; my thought was that this FORCELEADER call is issued only because the leader election isn't achieving a stable leader, so force marking the queue head replica as leader is okay. But I defer to your judgement if this is fine or not, and I can remove (or you feel free to remove) that code path from the patch if you feel it is not right.

          Show
          Ishan Chattopadhyaya added a comment - Thanks Shalin for looking into the patch and your review. ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough? Fixed. This was a left over from some previous patch. I think I wanted to put the waitForRecoveriesToFinish(), but forgot to remove the 5 second sleep. Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state. In this case, waiting those 15 seconds results in one of the down replicas to become a leader (but stay down). This is the situation I'm using FORCELEADER to recover from. Instead of waiting 15 seconds, I've added some polling with wait to wake up earlier if needed, while increasing the timeout from 15s to 25s. Can you please add some javadocs on the various test methods describing the scenario that they are test? Sure, added. minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages. Used assertEquals() now. In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO. Thanks for the pointer, I've removed the null check. I feel this should be INFO instead of DEBUG, so that if a user says I issued FORCELEADER but still nothing worked for him, his logs would help us understand if we ever had any LIR state which was cleared out. But, please feel free to remove it if this doesn't make sense. minor nit - you can compare enums directly using == instead of .equals Fixed. Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time? I modified the comment text to make it more clear. This is for the situation when all replicas are (somehow, due to bug maybe?) down/recovering (but not in LIR), and there is no leader, even though many replicas are on live; I don't know if this ever happens (the LIR case happens, I know). The testAllReplicasDownNoLeader test exercises this scenario. This is more or less the scenario that you described (with one difference that there is no leader as well): Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active' . As you point out, I think it can indeed interfere with any on-going leader election; my thought was that this FORCELEADER call is issued only because the leader election isn't achieving a stable leader, so force marking the queue head replica as leader is okay. But I defer to your judgement if this is fine or not, and I can remove (or you feel free to remove) that code path from the patch if you feel it is not right.
          Hide
          Ishan Chattopadhyaya added a comment -

          Based on an offline conversation with Shalin (and the discussion above), I've removed that extra handling of the situation where:

          1. there is no LIR involved
          2. all replicas are down
          3. there is no leader.

          This involved force marking the replica at the election queue head as a leader, which might have other unintended consequences. Hopefully, this situation never occurs in the real world. If it does, then we can tackle this in a separate issue.

          The following situation is still taken care of:

          1. there is no LIR involved
          2. all replicas are down

          Shalin Shekhar Mangar please review the changes. Thanks.

          Show
          Ishan Chattopadhyaya added a comment - Based on an offline conversation with Shalin (and the discussion above), I've removed that extra handling of the situation where: there is no LIR involved all replicas are down there is no leader. This involved force marking the replica at the election queue head as a leader, which might have other unintended consequences. Hopefully, this situation never occurs in the real world. If it does, then we can tackle this in a separate issue. The following situation is still taken care of: there is no LIR involved all replicas are down Shalin Shekhar Mangar please review the changes. Thanks.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Ishan but I think you missed the test in your latest patch? Its size has decreased from 36kb to 8kb.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Ishan but I think you missed the test in your latest patch? Its size has decreased from 36kb to 8kb.
          Hide
          Mark Miller added a comment -

          // Marking all live nodes as active.

          We do we do this manually like this? Shouldn't we allow this to happen naturally?

          Show
          Mark Miller added a comment - // Marking all live nodes as active. We do we do this manually like this? Shouldn't we allow this to happen naturally?
          Hide
          Mark Miller added a comment -

          It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader?

          Show
          Mark Miller added a comment - It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader?
          Hide
          Ishan Chattopadhyaya added a comment -

          Ah, missed out the test in my last patch. Here it is.

          Show
          Ishan Chattopadhyaya added a comment - Ah, missed out the test in my last patch. Here it is.
          Hide
          Shalin Shekhar Mangar added a comment -

          It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader?

          Do you mean that removing blockers like LIR is enough?

          Show
          Shalin Shekhar Mangar added a comment - It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader? Do you mean that removing blockers like LIR is enough?
          Hide
          Ishan Chattopadhyaya added a comment -

          It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader?

          It seems to me that there's no easy way to set the last published state of a replica without the replicas doing it themselves. Do you think we should be doing that instead of marking them as active? Or do you think that just clearing the LIR is enough?

          Show
          Ishan Chattopadhyaya added a comment - It seems like what we really want is to make sure the last published state for each replica does not prevent it from becoming the leader? It seems to me that there's no easy way to set the last published state of a replica without the replicas doing it themselves. Do you think we should be doing that instead of marking them as active? Or do you think that just clearing the LIR is enough?
          Hide
          Mark Miller added a comment -

          There are two main things I think that prevent replicas from becoming a leader - if there last published state on the clouddescriptor is not ACTIVE or LIR. I thought we would want to clear LIR and perhaps add an ADMIN command that will set the last published state on the clouddescriptor to ACTIVE for each replica.

          Show
          Mark Miller added a comment - There are two main things I think that prevent replicas from becoming a leader - if there last published state on the clouddescriptor is not ACTIVE or LIR. I thought we would want to clear LIR and perhaps add an ADMIN command that will set the last published state on the clouddescriptor to ACTIVE for each replica.
          Hide
          Ishan Chattopadhyaya added a comment -

          Updated patch. It now depends on SOLR-8233 and SOLR-7989.

          After removing the part that was doing the force marking of down live replicas as active, the recovery from a situation where all replicas were down (due to LIR) was not working. Reason was that a down replica, when elected as leader, never gets marked as active. Fixed that in SOLR-7989.

          Mark Miller, Shalin Shekhar Mangar Please review this (as well as SOLR-8233 and SOLR-7989). Also, can you please edit this issue to depend on those two issues (I can't edit, since Shalin created the issue).

          Show
          Ishan Chattopadhyaya added a comment - Updated patch. It now depends on SOLR-8233 and SOLR-7989 . After removing the part that was doing the force marking of down live replicas as active, the recovery from a situation where all replicas were down (due to LIR) was not working. Reason was that a down replica, when elected as leader, never gets marked as active. Fixed that in SOLR-7989 . Mark Miller , Shalin Shekhar Mangar Please review this (as well as SOLR-8233 and SOLR-7989 ). Also, can you please edit this issue to depend on those two issues (I can't edit, since Shalin created the issue).
          Hide
          Noble Paul added a comment -

          Let's not keep the core admin command as OVERRIDELASTPUBLISHED. This means it can be a generic enough API which may be abused by others for other things. Let's not tell others what we are doing internally and keep the command name opaque

          This particular collection admin operation does not really have to go to overseer, it can be performed by the receiving node itself because the clearing of LIR node does not have to be done at overseer anyway

          Show
          Noble Paul added a comment - Let's not keep the core admin command as OVERRIDELASTPUBLISHED. This means it can be a generic enough API which may be abused by others for other things. Let's not tell others what we are doing internally and keep the command name opaque This particular collection admin operation does not really have to go to overseer, it can be performed by the receiving node itself because the clearing of LIR node does not have to be done at overseer anyway
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Thanks for your review, Noble Paul.

          Let's not keep the core admin command as OVERRIDELASTPUBLISHED. This means it can be a generic enough API which may be abused by others for other things. Let's not tell others what we are doing internally and keep the command name opaque

          This patch uses FORCEPREPAREFORLEADERSHIP from SOLR-8233. Does this sound fine?

          This particular collection admin operation does not really have to go to overseer, it can be performed by the receiving node itself because the clearing of LIR node does not have to be done at overseer anyway

          The reason why I wanted to keep it at Overseer was that most cluster management code is there. I can move this to CollectionsHandler instead of OCMH.

          Show
          Ishan Chattopadhyaya added a comment - - edited Thanks for your review, Noble Paul . Let's not keep the core admin command as OVERRIDELASTPUBLISHED. This means it can be a generic enough API which may be abused by others for other things. Let's not tell others what we are doing internally and keep the command name opaque This patch uses FORCEPREPAREFORLEADERSHIP from SOLR-8233 . Does this sound fine? This particular collection admin operation does not really have to go to overseer, it can be performed by the receiving node itself because the clearing of LIR node does not have to be done at overseer anyway The reason why I wanted to keep it at Overseer was that most cluster management code is there. I can move this to CollectionsHandler instead of OCMH.
          Hide
          Ishan Chattopadhyaya added a comment -

          One down side of not having something like OVERRIDELASTPUBLISHED is that in the test, I couldn't set the last published to DOWN and check if it was set back to ACTIVE by the FORCELEADER. In this updated patch with FORCEPREPAREFORLEADERSHIP, the test has no easy way of setting the last published to down before the API command is called. Not a deal breaker, but just putting it out there. I'm personally fine either ways (or if there's another name that is more suitable).

          Show
          Ishan Chattopadhyaya added a comment - One down side of not having something like OVERRIDELASTPUBLISHED is that in the test, I couldn't set the last published to DOWN and check if it was set back to ACTIVE by the FORCELEADER. In this updated patch with FORCEPREPAREFORLEADERSHIP, the test has no easy way of setting the last published to down before the API command is called. Not a deal breaker, but just putting it out there. I'm personally fine either ways (or if there's another name that is more suitable).
          Hide
          Ishan Chattopadhyaya added a comment -

          This particular collection admin operation does not really have to go to overseer, it can be performed by the receiving node itself because the clearing of LIR node does not have to be done at overseer anyway

          Here is a patch that adds the API command (FORCELEADER) to the CollectionsHandler instead of the OCMH. I couldn't find a way to do this ASYNC, which I could do it at OCMH, did I miss something? Does this look fine? (Noble Paul ?)

          I somehow feel doing it in CollectionsHandler is a bit misplaced, and would rather do it at OCMH. But I am fine either ways so long as we do it; both patches are there.

          Note: As with the previous patch (that puts the meat into the OCMH), this patch depends on prior application of patches in SOLR-8233 and SOLR-7989.

          Show
          Ishan Chattopadhyaya added a comment - This particular collection admin operation does not really have to go to overseer, it can be performed by the receiving node itself because the clearing of LIR node does not have to be done at overseer anyway Here is a patch that adds the API command (FORCELEADER) to the CollectionsHandler instead of the OCMH. I couldn't find a way to do this ASYNC, which I could do it at OCMH, did I miss something? Does this look fine? ( Noble Paul ?) I somehow feel doing it in CollectionsHandler is a bit misplaced, and would rather do it at OCMH. But I am fine either ways so long as we do it; both patches are there. Note: As with the previous patch (that puts the meat into the OCMH), this patch depends on prior application of patches in SOLR-8233 and SOLR-7989 .
          Hide
          Mark Miller added a comment -

          I'm kind of split on where it should go. For something simple and brute force like this, CollectionsHandler is probably fine. Either way seems ok.

          I wouldn't really worry about it being async if it stays in CollectionsHandler.

          Show
          Mark Miller added a comment - I'm kind of split on where it should go. For something simple and brute force like this, CollectionsHandler is probably fine. Either way seems ok. I wouldn't really worry about it being async if it stays in CollectionsHandler.
          Hide
          ASF subversion and git services added a comment -

          Commit 1712851 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1712851 ]

          SOLR-7569: A collection API called FORCELEADER when all replicas in a shard are down

          Show
          ASF subversion and git services added a comment - Commit 1712851 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1712851 ] SOLR-7569 : A collection API called FORCELEADER when all replicas in a shard are down
          Hide
          ASF subversion and git services added a comment -

          Commit 1712852 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1712852 ]

          SOLR-7569: changed message

          Show
          ASF subversion and git services added a comment - Commit 1712852 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1712852 ] SOLR-7569 : changed message
          Hide
          ASF subversion and git services added a comment -

          Commit 1712854 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1712854 ]

          SOLR-7569: A collection API called FORCELEADER when all replicas in a shard are down

          Show
          ASF subversion and git services added a comment - Commit 1712854 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712854 ] SOLR-7569 : A collection API called FORCELEADER when all replicas in a shard are down
          Hide
          Ishan Chattopadhyaya added a comment -

          There was a failure on Windows in one of the tests,
          http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/5385/testReport/org.apache.solr.cloud/ForceLeaderTest/testReplicasInLIRNoLeader/

          I'd like to get this change in, please (SOLR-7569-testfix.patch). Sorry for the trouble.

          Show
          Ishan Chattopadhyaya added a comment - There was a failure on Windows in one of the tests, http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/5385/testReport/org.apache.solr.cloud/ForceLeaderTest/testReplicasInLIRNoLeader/ I'd like to get this change in, please ( SOLR-7569 -testfix.patch). Sorry for the trouble.
          Hide
          ASF subversion and git services added a comment -

          Commit 1713898 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1713898 ]

          SOLR-7989, SOLR-7569: Ignore this test.

          Show
          ASF subversion and git services added a comment - Commit 1713898 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1713898 ] SOLR-7989 , SOLR-7569 : Ignore this test.
          Hide
          ASF subversion and git services added a comment -

          Commit 1713899 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1713899 ]

          SOLR-7989, SOLR-7569: Ignore this test.

          Show
          ASF subversion and git services added a comment - Commit 1713899 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713899 ] SOLR-7989 , SOLR-7569 : Ignore this test.
          Hide
          Mark Miller added a comment -

          This feature and it's test counts on behavior that was added that was incorrect. See SOLR-7989. I've disabled the test for now.

          Show
          Mark Miller added a comment - This feature and it's test counts on behavior that was added that was incorrect. See SOLR-7989 . I've disabled the test for now.
          Hide
          Mark Miller added a comment -

          A better approach is probably for this API to deal with a DOWN but valid leader itself. It should only ever happen due to manually screwing up LIR and if this API is messing with LIR, it should also fix the ramifications.

          Perhaps the last thing the API should do is run through each shard and see if the registered leader is DOWN, and if it is make it ACTIVE (preferably by asking it to publish itself as ACTIVE - we don't want to publish for someone else). If the call waits around to make sure all the leaders come up, this should be simple.

          Show
          Mark Miller added a comment - A better approach is probably for this API to deal with a DOWN but valid leader itself. It should only ever happen due to manually screwing up LIR and if this API is messing with LIR, it should also fix the ramifications. Perhaps the last thing the API should do is run through each shard and see if the registered leader is DOWN, and if it is make it ACTIVE (preferably by asking it to publish itself as ACTIVE - we don't want to publish for someone else). If the call waits around to make sure all the leaders come up, this should be simple.
          Hide
          Mark Miller added a comment -

          It should only ever happen due to manually screwing up LIR and if this API is messing with LIR

          Down the road though, we will want to solve this for SOLR-7034 and SOLR-7065.

          I've taken a crack at making SOLR-7989 work.

          Show
          Mark Miller added a comment - It should only ever happen due to manually screwing up LIR and if this API is messing with LIR Down the road though, we will want to solve this for SOLR-7034 and SOLR-7065 . I've taken a crack at making SOLR-7989 work.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          I've taken a crack at making SOLR-7989 work.

          Thanks!

          Perhaps the last thing the API should do is run through each shard and see if the registered leader is DOWN, and if it is make it ACTIVE (preferably by asking it to publish itself as ACTIVE - we don't want to publish for someone else). If the call waits around to make sure all the leaders come up, this should be simple.

          This makes sense. I think this is something that Shalin alluded to (please excuse me if I'm mistaken) when he said, 1. Leader is live but 'down' -> mark it 'active'. The suggestion for the replicas to mark themselves ACTIVE instead of someone else marking them ACTIVE seems like a good thing to do.

          Show
          Ishan Chattopadhyaya added a comment - - edited I've taken a crack at making SOLR-7989 work. Thanks! Perhaps the last thing the API should do is run through each shard and see if the registered leader is DOWN, and if it is make it ACTIVE (preferably by asking it to publish itself as ACTIVE - we don't want to publish for someone else). If the call waits around to make sure all the leaders come up, this should be simple. This makes sense. I think this is something that Shalin alluded to (please excuse me if I'm mistaken) when he said, 1. Leader is live but 'down' -> mark it 'active' . The suggestion for the replicas to mark themselves ACTIVE instead of someone else marking them ACTIVE seems like a good thing to do.
          Hide
          Ishan Chattopadhyaya added a comment -

          Can we close this now, and create new JIRAs for future enhancements? Mark Miller, Shalin Shekhar Mangar, Noble Paul?

          Show
          Ishan Chattopadhyaya added a comment - Can we close this now, and create new JIRAs for future enhancements? Mark Miller , Shalin Shekhar Mangar , Noble Paul ?
          Hide
          ASF subversion and git services added a comment -

          Commit 1714842 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1714842 ]

          SOLR-7569 test failure fix

          Show
          ASF subversion and git services added a comment - Commit 1714842 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1714842 ] SOLR-7569 test failure fix
          Hide
          ASF subversion and git services added a comment -

          Commit 1714844 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1714844 ]

          SOLR-7569 test failure fix

          Show
          ASF subversion and git services added a comment - Commit 1714844 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1714844 ] SOLR-7569 test failure fix
          Hide
          Mike Drob added a comment -

          Can we close this now, and create new JIRAs for future enhancements? Mark Miller, Shalin Shekhar Mangar, Noble Paul?

          I agree with this.

          Show
          Mike Drob added a comment - Can we close this now, and create new JIRAs for future enhancements? Mark Miller, Shalin Shekhar Mangar, Noble Paul? I agree with this.
          Hide
          Mark Miller added a comment -

          This was only reopened because the test was ignored due to reverting SOLR-7989. With that resolved, this should be fine.

          Show
          Mark Miller added a comment - This was only reopened because the test was ignored due to reverting SOLR-7989 . With that resolved, this should be fine.

            People

            • Assignee:
              Noble Paul
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              5 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development