Solr
  1. Solr
  2. SOLR-3721

Multiple concurrent recoveries of same shard?

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: multicore, SolrCloud
    • Environment:

      Using our own Solr release based on Apache revision 1355667 from 4.x branch. Our changes to the Solr version is our solutions to TLT-3178 etc., and should have no effect on this issue.

      Description

      We run a performance/endurance test on a 7 Solr instance SolrCloud setup and eventually Solrs lose ZK connections and go into recovery. BTW the recovery often does not ever succeed, but we are looking into that. While doing that I noticed that, according to logs, multiple recoveries are in progress at the same time for the same shard. That cannot be intended and I can certainly imagine that it will cause some problems.
      It is just the logs that are wrong, did I make some mistake, or is this a real bug?

      See attached grep from log, grepping only on "Finished recovery" and "Starting recovery" logs.

      grep -B 1 "Finished recovery\|Starting recovery" solr9.log solr8.log solr7.log solr6.log solr5.log solr4.log solr3.log solr2.log solr1.log solr0.log > recovery_start_finish.log
      

      It can be hard to get an overview of the log, but I have generated a graph showing (based alone on "Started recovery" and "Finished recovery" logs) how many recoveries are in progress at any time for the different shards. See attached recovery_in_progress.png. The graph is also a little hard to get an overview of (due to the many shards) but it is clear that for several shards there are multiple recoveries going on at the same time, and that several recoveries never succeed.

      Regards, Per Steffensen

      1. recovery_in_progress.png
        72 kB
        Per Steffensen
      2. recovery_start_finish.log
        35 kB
        Per Steffensen

        Activity

        Hide
        Mark Miller added a comment -

        Correct, only one recovery should run at a time. I'll try and look into this.

        Show
        Mark Miller added a comment - Correct, only one recovery should run at a time. I'll try and look into this.
        Hide
        Per Steffensen added a comment - - edited

        What if e.g. you lose the ZK connection while in recovery - will it then start another recovery without checking if one is already going on.

        Show
        Per Steffensen added a comment - - edited What if e.g. you lose the ZK connection while in recovery - will it then start another recovery without checking if one is already going on.
        Hide
        Mark Miller added a comment -

        It should not, no. Each core has a recovery lock that should only allow one recovery to happen at a time. An further recovery attempts should line up and procede one at a time. That's not saying there is not a bug here - but that's the intention.

        Show
        Mark Miller added a comment - It should not, no. Each core has a recovery lock that should only allow one recovery to happen at a time. An further recovery attempts should line up and procede one at a time. That's not saying there is not a bug here - but that's the intention.
        Hide
        Per Steffensen added a comment -

        What if two Solrs, respectively running leader and replica for the same slice (only one replica), lose their ZK connection at about the same time. Then there will be no active shard that either of them can recover from. Could it be in such scenarios that multiple concurrent recoveries of the same shard somehow get started?

        BTW, the scenario above shouldnt end in a situation where the slice is just dead. The two shards in the same slice ought to find out who has the newest version of the shard-data (will probably be the one that was leader last), make that shard the leader (without recovering) and let the other shard recover from it. Is this scenarios handled (in the way I suggest or in another way) already in Solr 4.0 (beta - tip of branch) or is that a future thing (e.g. on 4.1 or 5.0)?

        Regards, Per Steffensen

        Show
        Per Steffensen added a comment - What if two Solrs, respectively running leader and replica for the same slice (only one replica), lose their ZK connection at about the same time. Then there will be no active shard that either of them can recover from. Could it be in such scenarios that multiple concurrent recoveries of the same shard somehow get started? BTW, the scenario above shouldnt end in a situation where the slice is just dead. The two shards in the same slice ought to find out who has the newest version of the shard-data (will probably be the one that was leader last), make that shard the leader (without recovering) and let the other shard recover from it. Is this scenarios handled (in the way I suggest or in another way) already in Solr 4.0 (beta - tip of branch) or is that a future thing (e.g. on 4.1 or 5.0)? Regards, Per Steffensen
        Hide
        Mark Miller added a comment -

        What if two Solrs, respectively running leader and replica for the same slice (only one replica), lose their ZK connection at about the same time. Then there will be no active shard that either of them can recover from. Could it be in such scenarios that multiple concurrent recoveries of the same shard somehow get started?

        No recovery will be started if a node cannot talk to zookeeper. So nothing would happen until one or both of the nodes reconnected to ZooKeeper. That would trigger a leader election, that leader node would attempt to sync up with all the other nodes for that shard and any recoveries would procede against him.

        There is a lock for each core that only allows one recovery per core to happen at a time. I'm not saying there is no bug in this, but that is the intention.

        BTW, the scenario above shouldnt end in a situation where the slice is just dead. The two shards in the same slice ought to find out who has the newest version of the shard-data (will probably be the one that was leader last), make that shard the leader (without recovering) and let the other shard recover from it. Is this scenarios handled (in the way I suggest or in another way) already in Solr 4.0 (beta - tip of branch) or is that a future thing (e.g. on 4.1 or 5.0)?

        It happens as I mention above. A little more detail on the "leader attempts to sync up":

        When a new node is elected as a leader by ZooKeeper it first tries to do a peer sync against every other live node. So lets say the first node in your two node situation comes back and he is behind the other node, but he comes back first and is elected leader. The second node has the latest updates, but is second in line to be leader and a few updates ahead. The potential leader will try and peer sync with the other node and get those missing updates if it's fewer than 100 or fail because the other node is ahead by too much. If the peer sync is a fail, the potential leader will give up his leader role, realizing that it seems there is a better candidate. The other node, being the next in line to be leader, will now try and peer sync with the other nodes in the shard. In this case, that will be a success since he is ahead of the first node. He will then ask the other nodes to peer sync to him. If they are less than 100 docs behind, it will succeed. If any sync back attempts fail, the leader tries to ask them to recover and they will replicate. Only after this sync process is completed does the leader advertise that he is now the leader in the cloud state.

        That is the current process - we will continually be hardening and improving it I'm sure.

        Show
        Mark Miller added a comment - What if two Solrs, respectively running leader and replica for the same slice (only one replica), lose their ZK connection at about the same time. Then there will be no active shard that either of them can recover from. Could it be in such scenarios that multiple concurrent recoveries of the same shard somehow get started? No recovery will be started if a node cannot talk to zookeeper. So nothing would happen until one or both of the nodes reconnected to ZooKeeper. That would trigger a leader election, that leader node would attempt to sync up with all the other nodes for that shard and any recoveries would procede against him. There is a lock for each core that only allows one recovery per core to happen at a time. I'm not saying there is no bug in this, but that is the intention. BTW, the scenario above shouldnt end in a situation where the slice is just dead. The two shards in the same slice ought to find out who has the newest version of the shard-data (will probably be the one that was leader last), make that shard the leader (without recovering) and let the other shard recover from it. Is this scenarios handled (in the way I suggest or in another way) already in Solr 4.0 (beta - tip of branch) or is that a future thing (e.g. on 4.1 or 5.0)? It happens as I mention above. A little more detail on the "leader attempts to sync up": When a new node is elected as a leader by ZooKeeper it first tries to do a peer sync against every other live node. So lets say the first node in your two node situation comes back and he is behind the other node, but he comes back first and is elected leader. The second node has the latest updates, but is second in line to be leader and a few updates ahead. The potential leader will try and peer sync with the other node and get those missing updates if it's fewer than 100 or fail because the other node is ahead by too much. If the peer sync is a fail, the potential leader will give up his leader role, realizing that it seems there is a better candidate. The other node, being the next in line to be leader, will now try and peer sync with the other nodes in the shard. In this case, that will be a success since he is ahead of the first node. He will then ask the other nodes to peer sync to him. If they are less than 100 docs behind, it will succeed. If any sync back attempts fail, the leader tries to ask them to recover and they will replicate. Only after this sync process is completed does the leader advertise that he is now the leader in the cloud state. That is the current process - we will continually be hardening and improving it I'm sure.
        Hide
        Per Steffensen added a comment - - edited

        No recovery will be started if a node cannot talk to zookeeper.

        Well, I knew that. I meant that the two Solrs where disconnected from ZK at the same time, but of course both got their connection reestablished - after session timeout (believe (kinda hope) that a session timeout has to have happened before Solr needs to go into recovery after a ZK connection loss)

        When it gets prioritized on my side, I will try to investigate further what causes the log to claim that many recoveries goes on for the same shard concurrently.

        When a new node is elected as a leader by ZooKeeper it first tries to do a peer sync against every other live node. So lets say the first node in your two node situation comes back and he is behind the other node, but he comes back first and is elected leader. The second node has the latest updates, but is second in line to be leader and a few updates ahead. The potential leader will try and peer sync with the other node and get those missing updates if it's fewer than 100 or fail because the other node is ahead by too much.

        Well we shouldnt let this issue (SOLR-3721) become about many other issues, but when the "behind" node has reconnected and become leader and the one with the latest updates does not come back live right away, isnt the new leader (which is behind) allowed to start handling update-requests. If yes, then it will be possible that both shards have documents/updates that the other one doesnt, and it is possible to come up with scenarios where there is no good algorithm for generating the "correct" merged union of the data in both shards. So what to do when the other shard (which used to have a later version than the current leader) comes live? Believe there is nothing solid to do!

        How to avoid that? I was thinking about keeping the latest version for every slice in ZK, so that a "behind" shard will know if it has the latest version of a slice, and therefore if it is allowed to take the role as leader. Of course the writing of this "latest version" to ZK and the writing of the corresponding update in leaders transaction-log would have to be atomic (like the A in ACID) as much as possible. And it would be nice if writing of the update in replica transaction-log would also be atomic with the leader-writing and the ZK writing, in order to increase the chance that a replica is actually allowed to take over the leader role if the leader dies (or both dies and replica comes back first). But all that is just an idea on top of my head.
        Do you already have a solution implemented or a solution on the drawing board or how do you/we prevent such a problem? As far as I understand "the drill" during leader-election/recovery (whether its peer-sync or file-copy-replication) from the little code-reading I have done and from what you explain, there is not a current solution. But I might be wrong?

        The other node, being the next in line to be leader, will now try and peer sync with the other nodes in the shard

        Guess/hope you mean "...with the other shards (running on different nodes) in the slice". As I understand Solr terminology a logical chunk of the "entire data" (a collection in Solr) is a "slice", and the data in a slice might physically exist more than one place (in more shards - if replication is used). Back when I started my interest in Solr I used a considerable amount of time understanding Solr terminology - mainly because it is different that what I have been used to (in my pre-Solr-world a "shard" is what you call a "slice") - so now please dont tell me that I misunderstood

        That is the current process - we will continually be hardening and improving it I'm sure.

        I will probably stick around for that. The correctness and robusteness of this live-replication feature is (currently) very important to us.

        Show
        Per Steffensen added a comment - - edited No recovery will be started if a node cannot talk to zookeeper. Well, I knew that. I meant that the two Solrs where disconnected from ZK at the same time, but of course both got their connection reestablished - after session timeout (believe (kinda hope) that a session timeout has to have happened before Solr needs to go into recovery after a ZK connection loss) When it gets prioritized on my side, I will try to investigate further what causes the log to claim that many recoveries goes on for the same shard concurrently. When a new node is elected as a leader by ZooKeeper it first tries to do a peer sync against every other live node. So lets say the first node in your two node situation comes back and he is behind the other node, but he comes back first and is elected leader. The second node has the latest updates, but is second in line to be leader and a few updates ahead. The potential leader will try and peer sync with the other node and get those missing updates if it's fewer than 100 or fail because the other node is ahead by too much. Well we shouldnt let this issue ( SOLR-3721 ) become about many other issues, but when the "behind" node has reconnected and become leader and the one with the latest updates does not come back live right away, isnt the new leader (which is behind) allowed to start handling update-requests. If yes, then it will be possible that both shards have documents/updates that the other one doesnt, and it is possible to come up with scenarios where there is no good algorithm for generating the "correct" merged union of the data in both shards. So what to do when the other shard (which used to have a later version than the current leader) comes live? Believe there is nothing solid to do! How to avoid that? I was thinking about keeping the latest version for every slice in ZK, so that a "behind" shard will know if it has the latest version of a slice, and therefore if it is allowed to take the role as leader. Of course the writing of this "latest version" to ZK and the writing of the corresponding update in leaders transaction-log would have to be atomic (like the A in ACID) as much as possible. And it would be nice if writing of the update in replica transaction-log would also be atomic with the leader-writing and the ZK writing, in order to increase the chance that a replica is actually allowed to take over the leader role if the leader dies (or both dies and replica comes back first). But all that is just an idea on top of my head. Do you already have a solution implemented or a solution on the drawing board or how do you/we prevent such a problem? As far as I understand "the drill" during leader-election/recovery (whether its peer-sync or file-copy-replication) from the little code-reading I have done and from what you explain, there is not a current solution. But I might be wrong? The other node, being the next in line to be leader, will now try and peer sync with the other nodes in the shard Guess/hope you mean "...with the other shards (running on different nodes) in the slice". As I understand Solr terminology a logical chunk of the "entire data" (a collection in Solr) is a "slice", and the data in a slice might physically exist more than one place (in more shards - if replication is used). Back when I started my interest in Solr I used a considerable amount of time understanding Solr terminology - mainly because it is different that what I have been used to (in my pre-Solr-world a "shard" is what you call a "slice") - so now please dont tell me that I misunderstood That is the current process - we will continually be hardening and improving it I'm sure. I will probably stick around for that. The correctness and robusteness of this live-replication feature is (currently) very important to us.
        Hide
        Mark Miller added a comment -

        Back when I started my interest in Solr I used a considerable amount of time understanding Solr terminology - mainly because it is different that what I have been used to (in my pre-Solr-world a "shard" is what you call a "slice") - so now please dont tell me that I misunderstood

        We never had agreement on this. It ended up being we use slice in code and shard means one thing or the other depending on context. Can it be confusing? I think so.

        Show
        Mark Miller added a comment - Back when I started my interest in Solr I used a considerable amount of time understanding Solr terminology - mainly because it is different that what I have been used to (in my pre-Solr-world a "shard" is what you call a "slice") - so now please dont tell me that I misunderstood We never had agreement on this. It ended up being we use slice in code and shard means one thing or the other depending on context. Can it be confusing? I think so.
        Hide
        Mark Miller added a comment -

        Believe there is nothing solid to do!

        Well, we can do some practical things right? I don't think we need to support a node coming back from the dead a year later and it had some updates the cluster doesn't have. A node coming up 2 minutes later is something we want to worry about though.

        So basically we either need something timing based or admin command based that lets you start a cold shard (slice ) and each node waits around for X amount of time or until command X is received, and then leader election begins.

        Show
        Mark Miller added a comment - Believe there is nothing solid to do! Well, we can do some practical things right? I don't think we need to support a node coming back from the dead a year later and it had some updates the cluster doesn't have. A node coming up 2 minutes later is something we want to worry about though. So basically we either need something timing based or admin command based that lets you start a cold shard (slice ) and each node waits around for X amount of time or until command X is received, and then leader election begins.
        Hide
        Jan Høydahl added a comment -

        ElasticSearch has some settings to control when recovery starts after cluster restart, see Guide. This approach looks reasonable. If we know that we expect N nodes in our cluster we can start recovery when we see N nodes up. If fewer than N nodes up, we wait for X time (running on local data, not accepting new updates) before recovery and leader election starts.

        Show
        Jan Høydahl added a comment - ElasticSearch has some settings to control when recovery starts after cluster restart, see Guide . This approach looks reasonable. If we know that we expect N nodes in our cluster we can start recovery when we see N nodes up. If fewer than N nodes up, we wait for X time (running on local data, not accepting new updates) before recovery and leader election starts.
        Hide
        Mark Miller added a comment -

        I also forgot to mention - not only is there the lock that is there to only allow one recovery to procede at a time, but every recovery cancels any in progress recovery before it starts - this does a Thread#join on the existing recovery thread. Then a new recovery thread starts.

        Show
        Mark Miller added a comment - I also forgot to mention - not only is there the lock that is there to only allow one recovery to procede at a time, but every recovery cancels any in progress recovery before it starts - this does a Thread#join on the existing recovery thread. Then a new recovery thread starts.
        Hide
        Per Steffensen added a comment -

        Took the comments (parts) on this issue that is really not around "multiple concurrent recoveries of same shard", but which is basically more around "Avoid losing data on ZK connection-loss/session-timeout", copy-pasted them to a mail with subject "Avoid losing data on ZK connection-loss/session-timeout" and sent it to dev@lucene.apache.com. Lets continue the discussion there.

        Show
        Per Steffensen added a comment - Took the comments (parts) on this issue that is really not around "multiple concurrent recoveries of same shard", but which is basically more around "Avoid losing data on ZK connection-loss/session-timeout", copy-pasted them to a mail with subject "Avoid losing data on ZK connection-loss/session-timeout" and sent it to dev@lucene.apache.com. Lets continue the discussion there.
        Hide
        Mark Miller added a comment -

        connection-loss should not be involved here - the general move for connection-loss is to retry - not go into any recoveries - until you hit a session-timeout. There is an exception in the leader election agl where you don't want to retry, but that is another story).

        (which you may know, but for clarification to those following along).

        Show
        Mark Miller added a comment - connection-loss should not be involved here - the general move for connection-loss is to retry - not go into any recoveries - until you hit a session-timeout. There is an exception in the leader election agl where you don't want to retry, but that is another story). (which you may know, but for clarification to those following along).
        Hide
        Mark Miller added a comment -

        I've been reviewing this code, and so far there is only way I can see that would seem to allow multiple recoveries at once - if there is an interrupt when cancelRecovery is doing a join, it seems another recovery could be started and briefly overlap. Since we already call close on the recovery thread, the overlap would be brief at best. An interrupt like this should be somewhat rare - but interrupts do happen on jetty shutdown. I'm going to guess this is not what you were seeing, but I'll plug it.

        Show
        Mark Miller added a comment - I've been reviewing this code, and so far there is only way I can see that would seem to allow multiple recoveries at once - if there is an interrupt when cancelRecovery is doing a join, it seems another recovery could be started and briefly overlap. Since we already call close on the recovery thread, the overlap would be brief at best. An interrupt like this should be somewhat rare - but interrupts do happen on jetty shutdown. I'm going to guess this is not what you were seeing, but I'll plug it.
        Hide
        Mark Miller added a comment -

        I've committed the fix to the issue I mention in the above comment.

        Show
        Mark Miller added a comment - I've committed the fix to the issue I mention in the above comment.
        Hide
        Per Steffensen added a comment -

        Cool. I dont believe this was our issue, as you also mention. But any fix of a potential problem is a step in the right direction.

        It will probably be a while before I get back to this issue, because we have changed our strategy. Before we where aming at including replication in first production-ready version of our application. We have had too many problems with the immature (IMHO) new way of doing replication in 4.0. Now we aim at not including replication, so I will not be dealing with replication to much for the next few months.

        Show
        Per Steffensen added a comment - Cool. I dont believe this was our issue, as you also mention. But any fix of a potential problem is a step in the right direction. It will probably be a while before I get back to this issue, because we have changed our strategy. Before we where aming at including replication in first production-ready version of our application. We have had too many problems with the immature (IMHO) new way of doing replication in 4.0. Now we aim at not including replication, so I will not be dealing with replication to much for the next few months.
        Hide
        Hoss Man added a comment -

        assigning to mark since it sounds like he is actively working on this

        Show
        Hoss Man added a comment - assigning to mark since it sounds like he is actively working on this
        Hide
        Mark Miller added a comment -

        This still an issue you see?

        Show
        Mark Miller added a comment - This still an issue you see?

          People

          • Assignee:
            Unassigned
            Reporter:
            Per Steffensen
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development