Solr
  1. Solr
  2. SOLR-6530

Commits under network partition can put any node in down state

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10.2, 5.0, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Commits are executed by any node in SolrCloud i.e. they're not routed via the leader like other updates.

      1. Suppose there's 1 collection, 1 shard, 2 replicas (A and B) and A is the leader
      2. Suppose a commit request is made to node B during a time where B cannot talk to A due to a partition for any reason (failing switch, heavy GC, whatever)
      3. B fails to distribute the commit to A (times out) and asks A to recover
      4. This was okay earlier because a leader just ignores recovery requests but with leader initiated recovery code, B puts A in the "down" state and A can never get out of that state.

      tl;dr; During network partitions, if enough commit/optimize requests are sent to the cluster, all the nodes in the cluster will eventually be marked as "down".

      1. SOLR-6530.patch
        10 kB
        Shalin Shekhar Mangar
      2. SOLR-6530.patch
        9 kB
        Shalin Shekhar Mangar
      3. SOLR-6530.patch
        10 kB
        Shalin Shekhar Mangar
      4. SOLR-6530.patch
        9 kB
        Shalin Shekhar Mangar
      5. SOLR-6530.patch
        14 kB
        Shalin Shekhar Mangar
      6. SOLR-6530.patch
        8 kB
        Shalin Shekhar Mangar
      7. SOLR-6530.patch
        8 kB
        Shalin Shekhar Mangar
      8. SOLR-6530.patch
        6 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment -

          A trivial test which demonstrates the problem by partitioning the leader from a replica and sending a commit to the replica which then marks the leader as "down".

          Show
          Shalin Shekhar Mangar added a comment - A trivial test which demonstrates the problem by partitioning the leader from a replica and sending a commit to the replica which then marks the leader as "down".
          Hide
          Shalin Shekhar Mangar added a comment -

          The fix is trivial. I added checks to make sure that the current core is a leader before LIR code is executed.

          There is an isLeader variable in DUP which I thought about using but it is true by default and not updated for commits. I think the whole leader logic thing is very trappy and it needs some refactoring. I'll try to tackle that later in a separate issue.

          Show
          Shalin Shekhar Mangar added a comment - The fix is trivial. I added checks to make sure that the current core is a leader before LIR code is executed. There is an isLeader variable in DUP which I thought about using but it is true by default and not updated for commits. I think the whole leader logic thing is very trappy and it needs some refactoring. I'll try to tackle that later in a separate issue.
          Hide
          Mark Miller added a comment -

          Something like this came up in another JIRA issue as well. Cannot remember which. I said it there as well, but beyond this improvement, I don't think it makes a lot of sense for someone to ask for a recover because a commit failed. There should be better / cheaper options.

          Show
          Mark Miller added a comment - Something like this came up in another JIRA issue as well. Cannot remember which. I said it there as well, but beyond this improvement, I don't think it makes a lot of sense for someone to ask for a recover because a commit failed. There should be better / cheaper options.
          Hide
          Shalin Shekhar Mangar added a comment -

          I don't think it makes a lot of sense for someone to ask for a recover because a commit failed. There should be better / cheaper options.

          I agree that commits shouldn't force recovery. What do you suggest? Should we specifically disallow commit requests from executing the LIR code?

          Show
          Shalin Shekhar Mangar added a comment - I don't think it makes a lot of sense for someone to ask for a recover because a commit failed. There should be better / cheaper options. I agree that commits shouldn't force recovery. What do you suggest? Should we specifically disallow commit requests from executing the LIR code?
          Hide
          Alan Woodward added a comment -

          Think I just came across another version of this:

          1. A is leader, and is distributing docs to B
          2. A gets a large GC pause, during which B takes over as leader
          3. A wakes up again, continues to send docs from DistributedUpdateProcessor, but now gets errors in response because B is refusing the updates (as they're marked as FROMLEADER, and B is now the leader)
          4. In DUP.doFinish(), A finds that it has a bunch of errors from B, and so it attempts to put B into recovery, with the same result as point 4 in the issue description

          Show
          Alan Woodward added a comment - Think I just came across another version of this: 1. A is leader, and is distributing docs to B 2. A gets a large GC pause, during which B takes over as leader 3. A wakes up again, continues to send docs from DistributedUpdateProcessor, but now gets errors in response because B is refusing the updates (as they're marked as FROMLEADER, and B is now the leader) 4. In DUP.doFinish(), A finds that it has a bunch of errors from B, and so it attempts to put B into recovery, with the same result as point 4 in the issue description
          Hide
          Ramkumar Aiyengar added a comment -

          In general, its leader initiated recovery, so if I am not the leader, I shouldn't be doing the logic for any operation. That's probably just commit for now since that's not forwarded to the leader, but if there's any other operation in the future which doesn't have to be coordinated by the leader, that could use the same logic?

          Show
          Ramkumar Aiyengar added a comment - In general, its leader initiated recovery, so if I am not the leader, I shouldn't be doing the logic for any operation. That's probably just commit for now since that's not forwarded to the leader, but if there's any other operation in the future which doesn't have to be coordinated by the leader, that could use the same logic?
          Hide
          Shalin Shekhar Mangar added a comment -

          Here's a better fix which uses the global (ZK) state instead of the local before executing the LIR code. From my reading of the code, the local isLeader variable in CloudDescriptor is not unset in all cases.

          Show
          Shalin Shekhar Mangar added a comment - Here's a better fix which uses the global (ZK) state instead of the local before executing the LIR code. From my reading of the code, the local isLeader variable in CloudDescriptor is not unset in all cases.
          Hide
          Shalin Shekhar Mangar added a comment -

          Alan Woodward - That sounds right. This fix will help to a great extent but it isn't perfect. I think we may need to add some intelligence to the overseer to eliminate invalid state transitions to the cluster state. Also SOLR-6538 can help in resolving such issues e.g. a leader being set to down state isn't aware of it's state and will never try to get out of it.

          Ramkumar Aiyengar - Yes, you're right that it could happen. We need to refactor this code such that these rules are properly defined and enforced.

          Show
          Shalin Shekhar Mangar added a comment - Alan Woodward - That sounds right. This fix will help to a great extent but it isn't perfect. I think we may need to add some intelligence to the overseer to eliminate invalid state transitions to the cluster state. Also SOLR-6538 can help in resolving such issues e.g. a leader being set to down state isn't aware of it's state and will never try to get out of it. Ramkumar Aiyengar - Yes, you're right that it could happen. We need to refactor this code such that these rules are properly defined and enforced.
          Hide
          Shalin Shekhar Mangar added a comment -

          My last fix was not complete. Checking if I am the leader is not enough because commits are broadcast to the entire collection without caring for the shards. So it is still possible that a core which is a leader of shard2 may run LIR code for a leader/replica of another shard.

          I've added a test case to reproduce this. The fix is again simple - we just don't run recovery for commits at all.

          Show
          Shalin Shekhar Mangar added a comment - My last fix was not complete. Checking if I am the leader is not enough because commits are broadcast to the entire collection without caring for the shards. So it is still possible that a core which is a leader of shard2 may run LIR code for a leader/replica of another shard. I've added a test case to reproduce this. The fix is again simple - we just don't run recovery for commits at all.
          Hide
          Ramkumar Aiyengar added a comment -

          Should LIR then be checked for: I am a leader and I am the leader for replica X which is not accessible?

          Show
          Ramkumar Aiyengar added a comment - Should LIR then be checked for: I am a leader and I am the leader for replica X which is not accessible?
          Hide
          Shalin Shekhar Mangar added a comment -

          bq, Should LIR then be checked for: I am a leader and I am the leader for replica X which is not accessible?

          That will be done by the initial fix that I made. But that's not enough for commits because it is sent to all shards.

          Show
          Shalin Shekhar Mangar added a comment - bq, Should LIR then be checked for: I am a leader and I am the leader for replica X which is not accessible? That will be done by the initial fix that I made. But that's not enough for commits because it is sent to all shards.
          Hide
          Shalin Shekhar Mangar added a comment -

          Okay now i got what you are saying. That's a good idea. I'll fix.

          Show
          Shalin Shekhar Mangar added a comment - Okay now i got what you are saying. That's a good idea. I'll fix.
          Hide
          Shalin Shekhar Mangar added a comment -

          Folding in the change suggested by Ramkumar.

          Show
          Shalin Shekhar Mangar added a comment - Folding in the change suggested by Ramkumar.
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch with the right test. I accidentally included an old version of the test with my last patch.

          Show
          Shalin Shekhar Mangar added a comment - Patch with the right test. I accidentally included an old version of the test with my last patch.
          Hide
          Shalin Shekhar Mangar added a comment -

          The last patch's test had a bug. It wasn't using the right proxies map. This is fixed now.

          Show
          Shalin Shekhar Mangar added a comment - The last patch's test had a bug. It wasn't using the right proxies map. This is fixed now.
          Hide
          Shalin Shekhar Mangar added a comment -

          Here's a better patch which removes the redundant isLeader check and also logs if the error'd node is not in the replica list of the current replica. All tests passed. This is ready.

          Show
          Shalin Shekhar Mangar added a comment - Here's a better patch which removes the redundant isLeader check and also logs if the error'd node is not in the replica list of the current replica. All tests passed. This is ready.
          Hide
          ASF subversion and git services added a comment -

          Commit 1628945 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1628945 ]

          SOLR-6530: Commits under network partitions can put any node in down state

          Show
          ASF subversion and git services added a comment - Commit 1628945 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1628945 ] SOLR-6530 : Commits under network partitions can put any node in down state
          Hide
          ASF subversion and git services added a comment -

          Commit 1629107 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1629107 ]

          SOLR-6530: Reopen the socket proxy after test finishes

          Show
          ASF subversion and git services added a comment - Commit 1629107 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1629107 ] SOLR-6530 : Reopen the socket proxy after test finishes
          Hide
          ASF subversion and git services added a comment -

          Commit 1629108 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1629108 ]

          SOLR-6530: Commits under network partitions can put any node in down state

          Show
          ASF subversion and git services added a comment - Commit 1629108 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1629108 ] SOLR-6530 : Commits under network partitions can put any node in down state
          Hide
          ASF subversion and git services added a comment -

          Commit 1632236 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1632236 ]

          SOLR-6530: Commits under network partitions can put any node in down state

          Show
          ASF subversion and git services added a comment - Commit 1632236 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1632236 ] SOLR-6530 : Commits under network partitions can put any node in down state
          Hide
          Shalin Shekhar Mangar added a comment -

          This is fixed. Thanks everyone!

          Show
          Shalin Shekhar Mangar added a comment - This is fixed. Thanks everyone!
          Hide
          ASF subversion and git services added a comment -

          Commit 1633276 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1633276 ]

          SOLR-6530: Protect against NPE when there are no live replicas

          Show
          ASF subversion and git services added a comment - Commit 1633276 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1633276 ] SOLR-6530 : Protect against NPE when there are no live replicas
          Hide
          ASF subversion and git services added a comment -

          Commit 1633277 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1633277 ]

          SOLR-6530: Protect against NPE when there are no live replicas

          Show
          ASF subversion and git services added a comment - Commit 1633277 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1633277 ] SOLR-6530 : Protect against NPE when there are no live replicas
          Hide
          ASF subversion and git services added a comment -

          Commit 1633278 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1633278 ]

          SOLR-6530: Protect against NPE when there are no live replicas

          Show
          ASF subversion and git services added a comment - Commit 1633278 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1633278 ] SOLR-6530 : Protect against NPE when there are no live replicas

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development