Solr
  1. Solr
  2. SOLR-7989

After a new leader is elected it, it should ensure it's state is ACTIVE if it has already registered with ZK.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      It is possible that a down replica gets elected as a leader, and that it stays down after the election.

      Here's how I hit upon this:

      • There are 3 replicas: leader, notleader0, notleader1
      • Introduced network partition to isolate notleader0, notleader1 from leader (leader puts these two in LIR via zk).
      • Kill leader, remove partition. Now leader is dead, and both of notleader0 and notleader1 are down. There is no leader.
      • Remove LIR znodes in zk.
      • Wait a while, and there happens a (flawed?) leader election.
      • Finally, the state is such that one of notleader0 or notleader1 (which were down before) become leader, but stays down.
      1. DownLeaderTest.java
        7 kB
        Ishan Chattopadhyaya
      2. DownLeaderTest.java
        7 kB
        Ishan Chattopadhyaya
      3. SOLR-7989.patch
        7 kB
        Mark Miller
      4. SOLR-7989.patch
        4 kB
        Mark Miller
      5. SOLR-7989.patch
        2 kB
        Ishan Chattopadhyaya
      6. SOLR-7989.patch
        3 kB
        Ishan Chattopadhyaya
      7. SOLR-7989.patch
        3 kB
        Ishan Chattopadhyaya
      8. SOLR-7989.patch
        2 kB
        Ishan Chattopadhyaya
      9. SOLR-7989.patch
        2 kB
        Ishan Chattopadhyaya
      10. SOLR-8233.patch
        5 kB
        Ishan Chattopadhyaya

        Issue Links

          Activity

          Hide
          Ishan Chattopadhyaya added a comment - - edited

          Finally understood why this is happening.

          When a DOWN replica is elected a leader, the ElectionContext's runLeaderProcess() finally tries to publish the new leader as well as the new state (now ACTIVE) in the same message. For example:

          {
            "operation":"leader",
            "shard":"shard1",
            "collection":"forceleader_test_collection",
            "base_url":"http://127.0.0.1:36501/sz_sie",
            "core":"forceleader_test_collection_shard1_replica2",
            "state":"active"}
          

          However, the OverseerAction for LEADER operation doesn't actually update the "state" that was passed in. So, although the replica gets elected as the leader, its state stays DOWN in the cluster state. Example (immediately after the leader election in above example):

          shard1:{
            "range":"80000000-7fffffff",
            "state":"active",
            "replicas":{
              "core_node1":{
                "core":"forceleader_test_collection_shard1_replica1",
                "base_url":"http://127.0.0.1:44845/sz_sie",
                "node_name":"127.0.0.1:44845_sz_sie",
                "state":"down"},
              "core_node2":{
                "state":"down",
                "base_url":"http://127.0.0.1:36501/sz_sie",
                "core":"forceleader_test_collection_shard1_replica2",
                "node_name":"127.0.0.1:36501_sz_sie",
                "leader":"true"},
              "core_node3":{
                "state":"down",
                "base_url":"http://127.0.0.1:33088/sz_sie",
                "core":"forceleader_test_collection_shard1_replica3",
                "node_name":"127.0.0.1:33088_sz_sie"}}}
          

          I'll raise a patch for this soon.

          Show
          Ishan Chattopadhyaya added a comment - - edited Finally understood why this is happening. When a DOWN replica is elected a leader, the ElectionContext's runLeaderProcess() finally tries to publish the new leader as well as the new state (now ACTIVE) in the same message. For example: { "operation":"leader", "shard":"shard1", "collection":"forceleader_test_collection", "base_url":"http://127.0.0.1:36501/sz_sie", "core":"forceleader_test_collection_shard1_replica2", "state":"active"} However, the OverseerAction for LEADER operation doesn't actually update the "state" that was passed in. So, although the replica gets elected as the leader, its state stays DOWN in the cluster state. Example (immediately after the leader election in above example): shard1:{ "range":"80000000-7fffffff", "state":"active", "replicas":{ "core_node1":{ "core":"forceleader_test_collection_shard1_replica1", "base_url":"http://127.0.0.1:44845/sz_sie", "node_name":"127.0.0.1:44845_sz_sie", "state":"down"}, "core_node2":{ "state":"down", "base_url":"http://127.0.0.1:36501/sz_sie", "core":"forceleader_test_collection_shard1_replica2", "node_name":"127.0.0.1:36501_sz_sie", "leader":"true"}, "core_node3":{ "state":"down", "base_url":"http://127.0.0.1:33088/sz_sie", "core":"forceleader_test_collection_shard1_replica3", "node_name":"127.0.0.1:33088_sz_sie"}}} I'll raise a patch for this soon.
          Hide
          Ishan Chattopadhyaya added a comment -

          Patch for fixing this problem.
          Split out the current overseer LEADER operation message into two messages, one for the LEADER and one for the STATE.

          Show
          Ishan Chattopadhyaya added a comment - Patch for fixing this problem. Split out the current overseer LEADER operation message into two messages, one for the LEADER and one for the STATE.
          Hide
          Ishan Chattopadhyaya added a comment - - edited

          The DownLeaderTest passes after this fix, fails without it.

          I'm adding the test separate from the patch for fix, since this test relies on a very specific scenario to hit this bug. However, there might be a more generic way of simulating the same scenario, just that I can't think of any simpler way of simulating in a unit test.

          Show
          Ishan Chattopadhyaya added a comment - - edited The DownLeaderTest passes after this fix, fails without it. I'm adding the test separate from the patch for fix, since this test relies on a very specific scenario to hit this bug. However, there might be a more generic way of simulating the same scenario, just that I can't think of any simpler way of simulating in a unit test.
          Hide
          Noble Paul added a comment -

          Send the state=ACTIVE first before the LEADER msg
          Let's check if the node is actually active before sending the state=ACTIVE message

          Show
          Noble Paul added a comment - Send the state=ACTIVE first before the LEADER msg Let's check if the node is actually active before sending the state=ACTIVE message
          Hide
          Ishan Chattopadhyaya added a comment -

          Thanks Noble, makes sense. Changed the order of the two messages.

          Show
          Ishan Chattopadhyaya added a comment - Thanks Noble, makes sense. Changed the order of the two messages.
          Hide
          Ishan Chattopadhyaya added a comment -

          Forgot the check for current state in the last patch. Added it now.

          Show
          Ishan Chattopadhyaya added a comment - Forgot the check for current state in the last patch. Added it now.
          Hide
          Ishan Chattopadhyaya added a comment -

          This check for state, and more precisely the use of zkStateReader to update/access the cluster state, is causing a hang/stall in `LeaderElectionTest`. I'm looking into the reason / fix.

          Show
          Ishan Chattopadhyaya added a comment - This check for state, and more precisely the use of zkStateReader to update/access the cluster state, is causing a hang/stall in `LeaderElectionTest`. I'm looking into the reason / fix.
          Hide
          Ishan Chattopadhyaya added a comment -

          From that test, LeaderElectionTest, the clusterstate was obtained as null. Added a check for that, so now the test passes smoothly as before.

          Attached the updated patch, running full suite of tests now.

          Show
          Ishan Chattopadhyaya added a comment - From that test, LeaderElectionTest , the clusterstate was obtained as null. Added a check for that, so now the test passes smoothly as before. Attached the updated patch, running full suite of tests now.
          Hide
          Mark Miller added a comment -

          Great catch Ishan!

          Here's how I hit upon this:

          Can we catch this in a unit test using the proxy jetties to simulate the partitions?

          Show
          Mark Miller added a comment - Great catch Ishan! Here's how I hit upon this: Can we catch this in a unit test using the proxy jetties to simulate the partitions?
          Hide
          Mark Miller added a comment -

          Nevermind, I see a new test is broken out on it's own, was just looking at the patch files. Awesome.

          Show
          Mark Miller added a comment - Nevermind, I see a new test is broken out on it's own, was just looking at the patch files. Awesome.
          Hide
          Noble Paul added a comment -

          That is a unit test. Never has any clusterstate . It only tests for leader election

          Show
          Noble Paul added a comment - That is a unit test. Never has any clusterstate . It only tests for leader election
          Hide
          Ishan Chattopadhyaya added a comment -

          Can we catch this in a unit test using the proxy jetties to simulate the partitions?

          Right, there's the DownLeaderTest. However, I don't think we should commit the test. The test is quite a complex way of proving that there is a problem, and that the problem is fixed once this patch goes in. But, it relies on LIR logic, and if LIR code changes later, the test will have to keep up with that, which is maintenance work. I couldn't find an easier way (apart from LIR) to trigger this particular situation (down replica becoming leader, staying down).

          Show
          Ishan Chattopadhyaya added a comment - Can we catch this in a unit test using the proxy jetties to simulate the partitions? Right, there's the DownLeaderTest. However, I don't think we should commit the test. The test is quite a complex way of proving that there is a problem, and that the problem is fixed once this patch goes in. But, it relies on LIR logic, and if LIR code changes later, the test will have to keep up with that, which is maintenance work. I couldn't find an easier way (apart from LIR) to trigger this particular situation (down replica becoming leader, staying down).
          Hide
          Mark Miller added a comment -

          I'd still lean toward trying to get that test in, even if just in some nightly form. I think these specific tests are great. And I'm hoping LIR is not going to have to change too much. As recovery code hardens, it should actually continually change less and less is my hope.

          Up to you though, I get the current concern with the test. But while the test is specific, it doesn't seem too terribly complicated. And we are going to want to add more and more specific tests over time. Perhaps we should do some work to make a test like this more possible with built in support somehow?

          Show
          Mark Miller added a comment - I'd still lean toward trying to get that test in, even if just in some nightly form. I think these specific tests are great. And I'm hoping LIR is not going to have to change too much. As recovery code hardens, it should actually continually change less and less is my hope. Up to you though, I get the current concern with the test. But while the test is specific, it doesn't seem too terribly complicated. And we are going to want to add more and more specific tests over time. Perhaps we should do some work to make a test like this more possible with built in support somehow?
          Hide
          Erick Erickson added a comment -

          How about just a comment in the test about "don't waste too much time making this test pass if LIR code changes" or something?

          Show
          Erick Erickson added a comment - How about just a comment in the test about "don't waste too much time making this test pass if LIR code changes" or something?
          Hide
          Ishan Chattopadhyaya added a comment -

          Attached the updated patch, running full suite of tests now.

          Ran many times now, some test or the other fails. Neither of them reproducible. I think this is good to go in.

          I just realized that I've written up a very similar test for SOLR-7569 (ForceLeaderTest). One of the tests in that suite does exactly what the DownLeaderTest is doing here. As with the test here, that test too fails without this patch in. Only difference is that in that test, the call to FORCELEADER is clearing the LIR state, whereas the DownLeaderTest here is clearing the LIR state directly. I think we should ignore the test here and when SOLR-7569 goes in, we'll have a test that covers the code path taken in this patch.

          Noble Paul, Mark Miller Can you please review and commit this? Thanks.

          Show
          Ishan Chattopadhyaya added a comment - Attached the updated patch, running full suite of tests now. Ran many times now, some test or the other fails. Neither of them reproducible. I think this is good to go in. I just realized that I've written up a very similar test for SOLR-7569 (ForceLeaderTest). One of the tests in that suite does exactly what the DownLeaderTest is doing here. As with the test here, that test too fails without this patch in. Only difference is that in that test, the call to FORCELEADER is clearing the LIR state, whereas the DownLeaderTest here is clearing the LIR state directly. I think we should ignore the test here and when SOLR-7569 goes in, we'll have a test that covers the code path taken in this patch. Noble Paul , Mark Miller Can you please review and commit this? Thanks.
          Hide
          ASF subversion and git services added a comment -

          Commit 1712753 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1712753 ]

          SOLR-7989: After a new leader is elected it should change it's state to ACTIVE even
          if the last published state is something else

          Show
          ASF subversion and git services added a comment - Commit 1712753 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1712753 ] SOLR-7989 : After a new leader is elected it should change it's state to ACTIVE even if the last published state is something else
          Hide
          ASF subversion and git services added a comment -

          Commit 1712781 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1712781 ]

          SOLR-7989: After a new leader is elected it should change it's state to ACTIVE even
          if the last published state is something else

          Show
          ASF subversion and git services added a comment - Commit 1712781 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712781 ] SOLR-7989 : After a new leader is elected it should change it's state to ACTIVE even if the last published state is something else
          Hide
          Ishan Chattopadhyaya added a comment -

          Thanks, Noble!

          Show
          Ishan Chattopadhyaya added a comment - Thanks, Noble!
          Hide
          Mark Miller added a comment -

          I think I was just hitting this pretty consistently while working on a new ChaosMonkey type test after fixing SOLR-8244.

          Had a live and down replica at the end that would never recover. Updated to this commit and it no longer happens.

          Show
          Mark Miller added a comment - I think I was just hitting this pretty consistently while working on a new ChaosMonkey type test after fixing SOLR-8244 . Had a live and down replica at the end that would never recover. Updated to this commit and it no longer happens.
          Hide
          Mark Miller added a comment -

          Why do we use the stale clusterstate to see if we are already active and prevent publishing active if we are not? Doesn't that just allow for unknown races? Shouldn't we just publish active regardless?

          Show
          Mark Miller added a comment - Why do we use the stale clusterstate to see if we are already active and prevent publishing active if we are not? Doesn't that just allow for unknown races? Shouldn't we just publish active regardless?
          Hide
          Ishan Chattopadhyaya added a comment -

          Shouldn't we just publish active regardless?

          That's what I wanted to do in my initial patch. Though, upon's Noble's comment to add the check, I thought it would help reduce one overseer message and be more efficient.

          Why do we use the stale clusterstate to see if we are already active and prevent publishing active if we are not?

          What do you think we should do, do you suggest (1) we force update the cluster state before the check so that we don't check against stale clusterstate, or (2) send the active state message regardless?

          Attaching the patch for (1), this required a change to the LeaderElectionTest.

          To do (2), it would require a change to OverseerTest.testOverseerStatsReset (SOLR-8249), and I don't currently know how to make it work if the STATE=ACTIVE message is sent regardless. If that's the way you suggest we should go, maybe I could raise a patch to send the message without a state check and disable the test for now.

          Show
          Ishan Chattopadhyaya added a comment - Shouldn't we just publish active regardless? That's what I wanted to do in my initial patch. Though, upon's Noble's comment to add the check, I thought it would help reduce one overseer message and be more efficient. Why do we use the stale clusterstate to see if we are already active and prevent publishing active if we are not? What do you think we should do, do you suggest (1) we force update the cluster state before the check so that we don't check against stale clusterstate, or (2) send the active state message regardless? Attaching the patch for (1), this required a change to the LeaderElectionTest. To do (2), it would require a change to OverseerTest.testOverseerStatsReset ( SOLR-8249 ), and I don't currently know how to make it work if the STATE=ACTIVE message is sent regardless. If that's the way you suggest we should go, maybe I could raise a patch to send the message without a state check and disable the test for now.
          Hide
          Noble Paul added a comment -

          99/100 times the state=ACTIVE

          you can do a force read before checking the state .

          Show
          Noble Paul added a comment - 99/100 times the state=ACTIVE you can do a force read before checking the state .
          Hide
          Ishan Chattopadhyaya added a comment -

          you can do a force read before checking the state .

          That's what I think I've done in the last patch. Can you/Mark please review?
          Sorry for missing this out; I did this update cluster state in a patch before the original commit itself; but the LeaderElectionTest failure threw me off and I somehow missed the update cluster state in the final patch for the commit.

          Show
          Ishan Chattopadhyaya added a comment - you can do a force read before checking the state . That's what I think I've done in the last patch. Can you/Mark please review? Sorry for missing this out; I did this update cluster state in a patch before the original commit itself; but the LeaderElectionTest failure threw me off and I somehow missed the update cluster state in the final patch for the commit.
          Hide
          Mark Miller added a comment -

          I actually think this is wrong on other levels.

          All publishing of state should go through ZkController - we do things like track the last published state. I'm not sure this is the right place to do this either. I'm reviewing this in another issue.

          Show
          Mark Miller added a comment - I actually think this is wrong on other levels. All publishing of state should go through ZkController - we do things like track the last published state. I'm not sure this is the right place to do this either. I'm reviewing this in another issue.
          Hide
          Mark Miller added a comment -

          Also, what about basic stuff like:

          Core starts up, starts election, it's not active. Becomes leader, it's not active so it's made active with this change.

          Now register is called. Register has the core do tlog replay and then is ready to go active. So it does. But we are already active too early?

          Show
          Mark Miller added a comment - Also, what about basic stuff like: Core starts up, starts election, it's not active. Becomes leader, it's not active so it's made active with this change. Now register is called. Register has the core do tlog replay and then is ready to go active. So it does. But we are already active too early?
          Hide
          Mark Miller added a comment -

          There is def an issue to be fixed here, but I think it may be more complicated than this.

          Show
          Mark Miller added a comment - There is def an issue to be fixed here, but I think it may be more complicated than this.
          Hide
          Mark Miller added a comment -

          Hmm...doesn't this test create the problem itself?

          It puts all bunch of replicas in LIR manually, causing the DOWN state. Then it manually clears LIR and tries to see who will become leader.

          This is not a normal case?

          In any case, I will revert the current fix. Now I think the test may be creating the problem itself, but regardless, this change really breaks things.

          Show
          Mark Miller added a comment - Hmm...doesn't this test create the problem itself? It puts all bunch of replicas in LIR manually, causing the DOWN state. Then it manually clears LIR and tries to see who will become leader. This is not a normal case? In any case, I will revert the current fix. Now I think the test may be creating the problem itself, but regardless, this change really breaks things.
          Hide
          ASF subversion and git services added a comment -

          Commit 1713882 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1713882 ]

          SOLR-7989: Revert changes.

          Show
          ASF subversion and git services added a comment - Commit 1713882 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1713882 ] SOLR-7989 : Revert changes.
          Hide
          ASF subversion and git services added a comment -

          Commit 1713883 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1713883 ]

          SOLR-7989: Revert changes.

          Show
          ASF subversion and git services added a comment - Commit 1713883 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713883 ] SOLR-7989 : Revert changes.
          Hide
          Mark Miller added a comment -

          Ishan Chattopadhyaya, I don't think this is a valid test at all.

          Why does it clear LIR while Solr is running? That is not a valid real world scenario.

          Show
          Mark Miller added a comment - Ishan Chattopadhyaya , I don't think this is a valid test at all. Why does it clear LIR while Solr is running? That is not a valid real world scenario.
          Hide
          Mark Miller added a comment -

          Why does it clear LIR while Solr is running?

          I guess this tests our emergency force a leader API? That's part of why that API is so janky. It's not really valid to simply clear out LIR nodes on a running system. And it's a really hard problem to make it doable - this approach certainly won't work.

          What we really need to do is fix things so it's not needed.

          Show
          Mark Miller added a comment - Why does it clear LIR while Solr is running? I guess this tests our emergency force a leader API? That's part of why that API is so janky. It's not really valid to simply clear out LIR nodes on a running system. And it's a really hard problem to make it doable - this approach certainly won't work. What we really need to do is fix things so it's not needed.
          Hide
          ASF subversion and git services added a comment -

          Commit 1713898 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1713898 ]

          SOLR-7989, SOLR-7569: Ignore this test.

          Show
          ASF subversion and git services added a comment - Commit 1713898 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1713898 ] SOLR-7989 , SOLR-7569 : Ignore this test.
          Hide
          ASF subversion and git services added a comment -

          Commit 1713899 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1713899 ]

          SOLR-7989, SOLR-7569: Ignore this test.

          Show
          ASF subversion and git services added a comment - Commit 1713899 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1713899 ] SOLR-7989 , SOLR-7569 : Ignore this test.
          Hide
          Mark Miller added a comment -

          Something like this might work.

          Show
          Mark Miller added a comment - Something like this might work.
          Hide
          Mark Miller added a comment -

          New version.

          Added some logging.

          Changed so we won't publish ACTIVE after a leader election win if the current state is RECOVERING. Doesn't seem safe to force a move from RECOVERING to ACTIVE.

          Show
          Mark Miller added a comment - New version. Added some logging. Changed so we won't publish ACTIVE after a leader election win if the current state is RECOVERING. Doesn't seem safe to force a move from RECOVERING to ACTIVE.
          Hide
          Ishan Chattopadhyaya added a comment -

          I don't think this is a valid test at all. Why does it clear LIR while Solr is running? That is not a valid real world scenario.

          Indeed, that's why I didn't include it in the patch. I just wanted to bring out that there is a problem, but then I couldn't replicate it in a test in any other way that represents a real world situation. However, just that fact that there were two messages passed as one (LEADER message had a state param) indicated to me that we should, in theory, split them out.

          +1 to your patch, I hadn't considered the core registration scenario. Also, the change from only DOWN to ACTIVE makes sense, and excluding the RECOVERING to ACTIVE change.

          Show
          Ishan Chattopadhyaya added a comment - I don't think this is a valid test at all. Why does it clear LIR while Solr is running? That is not a valid real world scenario. Indeed, that's why I didn't include it in the patch. I just wanted to bring out that there is a problem, but then I couldn't replicate it in a test in any other way that represents a real world situation. However, just that fact that there were two messages passed as one (LEADER message had a state param) indicated to me that we should, in theory, split them out. +1 to your patch, I hadn't considered the core registration scenario. Also, the change from only DOWN to ACTIVE makes sense, and excluding the RECOVERING to ACTIVE change.
          Hide
          Mark Miller added a comment -

          Indeed, that's why I didn't include it in the patch.

          I think that doing that is fine in a test - we just have to carefully understand what we are testing and what we are fixing. This one was particularly tricky - I did not catch the issue the first few times I looked at it. Many things are impossible to test without some of this artificial hand waving though.

          Show
          Mark Miller added a comment - Indeed, that's why I didn't include it in the patch. I think that doing that is fine in a test - we just have to carefully understand what we are testing and what we are fixing. This one was particularly tricky - I did not catch the issue the first few times I looked at it. Many things are impossible to test without some of this artificial hand waving though.
          Hide
          Mark Miller added a comment -

          Also, something I was not really aware of initially, what you are doing does seem to be a valid test for the force leader API that you added in the other issue.

          Show
          Mark Miller added a comment - Also, something I was not really aware of initially, what you are doing does seem to be a valid test for the force leader API that you added in the other issue.
          Hide
          ASF subversion and git services added a comment -

          Commit 1714211 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1714211 ]

          SOLR-7989: After a new leader is elected it, it should ensure it's state is ACTIVE if it has already registered with ZK.

          Show
          ASF subversion and git services added a comment - Commit 1714211 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1714211 ] SOLR-7989 : After a new leader is elected it, it should ensure it's state is ACTIVE if it has already registered with ZK.
          Hide
          ASF subversion and git services added a comment -

          Commit 1714212 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1714212 ]

          SOLR-7989: After a new leader is elected it, it should ensure it's state is ACTIVE if it has already registered with ZK.

          Show
          ASF subversion and git services added a comment - Commit 1714212 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1714212 ] SOLR-7989 : After a new leader is elected it, it should ensure it's state is ACTIVE if it has already registered with ZK.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Ishan Chattopadhyaya
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development