Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8696

Start the Overseer before actions that need the overseer on init and when reconnecting after zk expiration and improve init logic.

    Details

    • Flags:
      Patch

      Description

      ZkController.publishAndWaitForDownStates() occurs before overseer election. That means if there is currently no overseer, there is ironically no one to actually service the down state changes it's waiting on. This particularly affects a single-node cluster such as you might run locally for development.

      Additionally, we're doing an unnecessary ZkStateReader forced refresh on all Overseer operations. This isn't necessary because ZkStateReader keeps itself up to date.

      1. SOLR-8696.patch
        4 kB
        Mark Miller
      2. SOLR-8696.patch
        3 kB
        Scott Blum
      3. SOLR-8696-followup.patch
        2 kB
        Scott Blum

        Issue Links

          Activity

          Hide
          dragonsinth Scott Blum added a comment -
          Show
          dragonsinth Scott Blum added a comment - Shalin Shekhar Mangar
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          In one of the leader election improvement patches, I think I took the wait out altogether. Starting the Overseer election first makes sense though.

          Additionally, we're doing an unnecessary ZkStateReader forced refresh on all Overseer operations. This isn't necessary because ZkStateReader keeps itself up to date.

          It keeps itself up to date, but with some delay. Are we sure we don't need to know we have the latest state at that point? What about the case when we just took over for another Overseer? No races?

          Show
          markrmiller@gmail.com Mark Miller added a comment - In one of the leader election improvement patches, I think I took the wait out altogether. Starting the Overseer election first makes sense though. Additionally, we're doing an unnecessary ZkStateReader forced refresh on all Overseer operations. This isn't necessary because ZkStateReader keeps itself up to date. It keeps itself up to date, but with some delay. Are we sure we don't need to know we have the latest state at that point? What about the case when we just took over for another Overseer? No races?
          Hide
          dragonsinth Scott Blum added a comment -

          It keeps itself up to date, but with some delay. Are we sure we don't need to know we have the latest state at that point? What about the case when we just took over for another Overseer? No races?

          Forcing a refresh for that case (just took over) seems totally reasonable to me.

          Show
          dragonsinth Scott Blum added a comment - It keeps itself up to date, but with some delay. Are we sure we don't need to know we have the latest state at that point? What about the case when we just took over for another Overseer? No races? Forcing a refresh for that case (just took over) seems totally reasonable to me.
          Hide
          dragonsinth Scott Blum added a comment -

          Actually on second thought, to become the new Overseer, you would have had to observe the ephemeral node disappear for the previous overseer. If you've observed that, and subsequently updated yourself as overseer leader, you would have had to observe all previous cluster state changes committed by the previous Overseer. So I think it's actually fine in that case.

          Show
          dragonsinth Scott Blum added a comment - Actually on second thought, to become the new Overseer, you would have had to observe the ephemeral node disappear for the previous overseer. If you've observed that, and subsequently updated yourself as overseer leader, you would have had to observe all previous cluster state changes committed by the previous Overseer. So I think it's actually fine in that case.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks Scott.

          The ZkStateReader refresh is a safe guard to ensure that before processing any operation, the effects of previous operations are reflected in the local cluster state. This is less important now because most operations do wait for the cluster state to reflect their results (e.g. collection creation/deletion) etc.

          I audited all overseer operations and the following operations do not have such a wait loop:

          1. ADDREPLICA does not wait for the replica to show up in the local cluster state (which may cause the same core name to be assigned again for a subsequent addReplica operation)
          2. ADDREPLICAPROP
          3. DELETEREPLICAPROP
          4. BALANCESHARDUNIQUE
          5. MODIFYCOLLECTION

          Of all these only add replica should be fixed. I don't foresee any damage for the others.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks Scott. The ZkStateReader refresh is a safe guard to ensure that before processing any operation, the effects of previous operations are reflected in the local cluster state. This is less important now because most operations do wait for the cluster state to reflect their results (e.g. collection creation/deletion) etc. I audited all overseer operations and the following operations do not have such a wait loop: ADDREPLICA does not wait for the replica to show up in the local cluster state (which may cause the same core name to be assigned again for a subsequent addReplica operation) ADDREPLICAPROP DELETEREPLICAPROP BALANCESHARDUNIQUE MODIFYCOLLECTION Of all these only add replica should be fixed. I don't foresee any damage for the others.
          Hide
          dragonsinth Scott Blum added a comment -

          Great! Would you like me to include a fix for ADDREPLICA in this patch? Seems like a good thing to have regardless.

          Show
          dragonsinth Scott Blum added a comment - Great! Would you like me to include a fix for ADDREPLICA in this patch? Seems like a good thing to have regardless.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          That'd be great, thanks!

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - That'd be great, thanks!
          Hide
          dragonsinth Scott Blum added a comment -

          Shalin Shekhar Mangar Am I crazy or doesn't the call to waitToSeeReplicasInState() do what you're talking about? Loop until the new replica appears in the local cluster state?

          I tried to find a way to step through, but I don't actually see an AddReplicaTest....

          Show
          dragonsinth Scott Blum added a comment - Shalin Shekhar Mangar Am I crazy or doesn't the call to waitToSeeReplicasInState() do what you're talking about? Loop until the new replica appears in the local cluster state? I tried to find a way to step through, but I don't actually see an AddReplicaTest....
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Yeah but that only happens for clusters configured with legacyCloud=false.

          To give you some context behind this: the default is legacyCloud=true which means that any core admin create command can create corresponding replica in clusterstate. This also means that if you deleted a replica which was down at the time, it may re-create itself in the cluster state once it comes back up. To avoid such problems, a new cluster property called 'legacyCloud' was introduced which required all cluster state modifications to be from the overseer only. The idea was that this "Zookeeper as truth" mode would eventually become the default but we haven't made much progress since.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Yeah but that only happens for clusters configured with legacyCloud=false. To give you some context behind this: the default is legacyCloud=true which means that any core admin create command can create corresponding replica in clusterstate. This also means that if you deleted a replica which was down at the time, it may re-create itself in the cluster state once it comes back up. To avoid such problems, a new cluster property called 'legacyCloud' was introduced which required all cluster state modifications to be from the overseer only. The idea was that this "Zookeeper as truth" mode would eventually become the default but we haven't made much progress since.
          Hide
          dragonsinth Scott Blum added a comment -

          Okay, I can't figure this out after digging for quite a while. I need more help with this! I can't figure out how the cluster state is getting update in legacy mode. I've tried putting breakpoints in ZkStateWriter, SliceMutator.addReplica, etc, and run the following tests:

          CollectionsAPISolrJTests.testAddAndDeleteReplica
          CollectionsAPIDistributedZkTest.addReplicaTest

          But I'm not getting a hit. What gives?

          Show
          dragonsinth Scott Blum added a comment - Okay, I can't figure this out after digging for quite a while. I need more help with this! I can't figure out how the cluster state is getting update in legacy mode. I've tried putting breakpoints in ZkStateWriter, SliceMutator.addReplica, etc, and run the following tests: CollectionsAPISolrJTests.testAddAndDeleteReplica CollectionsAPIDistributedZkTest.addReplicaTest But I'm not getting a hit. What gives?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I made a run at removing legacyMode before the 6 release, but while it was pretty easy to take from non test code, it requires a really large change to the tests to move away from it.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I made a run at removing legacyMode before the 6 release, but while it was pretty easy to take from non test code, it requires a really large change to the tests to move away from it.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          But I'm not getting a hit. What gives?

          I'm confused. Does that test even try to run in legacy mode? Can you elaborate a bit? Not sure I fully understand. If I set a break point at SliceMutator.addReplica and run CollectionsAPISolrJTests.testAddAndDeleteReplica, I hit the break point. What other change are you making?

          Show
          markrmiller@gmail.com Mark Miller added a comment - But I'm not getting a hit. What gives? I'm confused. Does that test even try to run in legacy mode? Can you elaborate a bit? Not sure I fully understand. If I set a break point at SliceMutator.addReplica and run CollectionsAPISolrJTests.testAddAndDeleteReplica, I hit the break point. What other change are you making?
          Hide
          dragonsinth Scott Blum added a comment - - edited

          What git sha are you on? I'm absolutely not hitting that break point running that exact test. I'm at 091889cf79e15909963b6fad6c0a5394a34764bc. The only local changes I have are to comment out the calls to the other tests in that class so that testAddAndDeleteReplica is the only test method run.

          Nor am I seeing this log message in the test logs:

            public ZkWriteCommand addReplica(ClusterState clusterState, ZkNodeProps message) {
              log.info("createReplica() {} ", message);
          
          Show
          dragonsinth Scott Blum added a comment - - edited What git sha are you on? I'm absolutely not hitting that break point running that exact test. I'm at 091889cf79e15909963b6fad6c0a5394a34764bc. The only local changes I have are to comment out the calls to the other tests in that class so that testAddAndDeleteReplica is the only test method run. Nor am I seeing this log message in the test logs: public ZkWriteCommand addReplica(ClusterState clusterState, ZkNodeProps message) { log.info( "createReplica() {} " , message);
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I was on recent master at the time.

          I'll replicate and report steps and results.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I was on recent master at the time. I'll replicate and report steps and results.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The only local changes I have are to comment out the calls to the other tests in that class so that testAddAndDeleteReplica is the only test method run.

          Same single change I made. Strange.

          Show
          markrmiller@gmail.com Mark Miller added a comment - The only local changes I have are to comment out the calls to the other tests in that class so that testAddAndDeleteReplica is the only test method run. Same single change I made. Strange.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          SliceMutator.addReplica

          Ah, got you sorry, check out: OverseerCollectionMessageHandler#addReplica?

          Show
          markrmiller@gmail.com Mark Miller added a comment - SliceMutator.addReplica Ah, got you sorry, check out: OverseerCollectionMessageHandler#addReplica?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I'd love to get legacy mode out for 6, but I doubt I'll have the time to tackle it myself. The code change is pretty easy, the test framework ramifications are quite large.

          SOLR-8256: Remove cluster setting 'legacy cloud' in 6x.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I'd love to get legacy mode out for 6, but I doubt I'll have the time to tackle it myself. The code change is pretty easy, the test framework ramifications are quite large. SOLR-8256 : Remove cluster setting 'legacy cloud' in 6x.
          Hide
          dragonsinth Scott Blum added a comment -

          Mark Miller

          Yes, I get the breakpoint in OverseerCollectionMessageHandler#addReplica. I'm trying to follow up on Shalin Shekhar Mangar's request to add a wait loop at the end of that method. But where I'm struggling is, I don't actually understand where and how the ZK cluster state ever gets updated in legacy mode. Who updates the ZK cluster state as a result of OverseerCollectionMessageHandler#addReplica?

          Show
          dragonsinth Scott Blum added a comment - Mark Miller Yes, I get the breakpoint in OverseerCollectionMessageHandler#addReplica. I'm trying to follow up on Shalin Shekhar Mangar 's request to add a wait loop at the end of that method. But where I'm struggling is, I don't actually understand where and how the ZK cluster state ever gets updated in legacy mode. Who updates the ZK cluster state as a result of OverseerCollectionMessageHandler#addReplica?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          addReplica calls create core on a node. When a core is created, it registers with zk and you have the new replica. Just doing a state update on registering will get it in the clusterstate.

          Show
          markrmiller@gmail.com Mark Miller added a comment - addReplica calls create core on a node. When a core is created, it registers with zk and you have the new replica. Just doing a state update on registering will get it in the clusterstate.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The history of legacy mode:

          Initially, we had no collections API. Collections were created by a SolrCore registering with ZK and doing state updates. This allowed you to pre-configure SolrCores in solr.xml as you could with non SolrCloud. It was basically a way to quickly get going. But it has many problems.

          Eventually we got the collections API and for back compat reasons added the legacyMode property. When not in legacy mode, state updates will not created collections - you must explicitly use the collections API. It's the start of what we call "zookeeper=truth" though currently only a small part of that is implemented.

          Show
          markrmiller@gmail.com Mark Miller added a comment - The history of legacy mode: Initially, we had no collections API. Collections were created by a SolrCore registering with ZK and doing state updates. This allowed you to pre-configure SolrCores in solr.xml as you could with non SolrCloud. It was basically a way to quickly get going. But it has many problems. Eventually we got the collections API and for back compat reasons added the legacyMode property. When not in legacy mode, state updates will not created collections - you must explicitly use the collections API. It's the start of what we call "zookeeper=truth" though currently only a small part of that is implemented.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I think we should break up these two changes. Can we fix the nasty overseer init on reconnect bug here and open another to continue on the explicit state update?

          Show
          markrmiller@gmail.com Mark Miller added a comment - I think we should break up these two changes. Can we fix the nasty overseer init on reconnect bug here and open another to continue on the explicit state update?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          New patch. Fixes the same overseer situation on zk reconnect (I've seen the logging that shows this problem is really and kind of ugly there too) and removes the forced refresh change.

          Show
          markrmiller@gmail.com Mark Miller added a comment - New patch. Fixes the same overseer situation on zk reconnect (I've seen the logging that shows this problem is really and kind of ugly there too) and removes the forced refresh change.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Need to remove the duplicate call to zkStateReader.createClusterStateWatchersAndUpdate();

          Show
          markrmiller@gmail.com Mark Miller added a comment - Need to remove the duplicate call to zkStateReader.createClusterStateWatchersAndUpdate();
          Hide
          dragonsinth Scott Blum added a comment - - edited

          Splitting it up sounds like a great idea.

          Show
          dragonsinth Scott Blum added a comment - - edited Splitting it up sounds like a great idea.
          Show
          dragonsinth Scott Blum added a comment - https://issues.apache.org/jira/browse/SOLR-8722
          Hide
          dragonsinth Scott Blum added a comment -

          Patch LGTM. On an unrelated note, I've never understood why it's so important to register all cores as down on startup, since presumably we're about to bring them all up..

          Show
          dragonsinth Scott Blum added a comment - Patch LGTM. On an unrelated note, I've never understood why it's so important to register all cores as down on startup, since presumably we're about to bring them all up..
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The stale state could say ACTIVE. No guarantee shutdown publishes DOWN (a crash won't).

          The one in init, I'm less sure of. I think Timothy Potter may have had to add that when he worked on recovering from partitions properly.

          Show
          markrmiller@gmail.com Mark Miller added a comment - The stale state could say ACTIVE. No guarantee shutdown publishes DOWN (a crash won't). The one in init, I'm less sure of. I think Timothy Potter may have had to add that when he worked on recovering from partitions properly.
          Hide
          dragonsinth Scott Blum added a comment -

          Thanks. Way beyond the scope of this ticket, I was just curious about it. It seemed a little silly on the face of it, because solr already has to tolerate long periods of things being marked ACTIVE in cluster state that aren't really up, since a node can crash and take a long time to recover. I feel like all a recovering node really needs to do is verify that all the cores it owns are in the correct state immediately before publishing its live node entry. We could save a lot of unnecessary overseer ops here on node restart.

          Show
          dragonsinth Scott Blum added a comment - Thanks. Way beyond the scope of this ticket, I was just curious about it. It seemed a little silly on the face of it, because solr already has to tolerate long periods of things being marked ACTIVE in cluster state that aren't really up, since a node can crash and take a long time to recover. I feel like all a recovering node really needs to do is verify that all the cores it owns are in the correct state immediately before publishing its live node entry. We could save a lot of unnecessary overseer ops here on node restart.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8ac4fdd6bb84b225919d30f6073e7dad02aeb0a1 in lucene-solr's branch refs/heads/master from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ac4fdd ]

          SOLR-8696: Start the Overseer before actions that need the overseer on init and when reconnecting after zk expiration and improve init logic.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8ac4fdd6bb84b225919d30f6073e7dad02aeb0a1 in lucene-solr's branch refs/heads/master from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ac4fdd ] SOLR-8696 : Start the Overseer before actions that need the overseer on init and when reconnecting after zk expiration and improve init logic.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks Scott!

          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks Scott!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8ac4fdd6bb84b225919d30f6073e7dad02aeb0a1 in lucene-solr's branch refs/heads/jira/SOLR-445 from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ac4fdd ]

          SOLR-8696: Start the Overseer before actions that need the overseer on init and when reconnecting after zk expiration and improve init logic.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8ac4fdd6bb84b225919d30f6073e7dad02aeb0a1 in lucene-solr's branch refs/heads/jira/ SOLR-445 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ac4fdd ] SOLR-8696 : Start the Overseer before actions that need the overseer on init and when reconnecting after zk expiration and improve init logic.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks guys and sorry for not paying attention here.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks guys and sorry for not paying attention here.
          Hide
          dragonsinth Scott Blum added a comment - - edited

          Mark Miller Shalin Shekhar Mangar

          I think there's a slight problem with the fix as landed (merge artifact?)

                zkStateReader.createClusterStateWatchersAndUpdate();
                Stat stat = zkClient.exists(ZkStateReader.LIVE_NODES_ZKNODE, null, true);
                if (stat != null && stat.getNumChildren() > 0) {
                  zkStateReader.createClusterStateWatchersAndUpdate();
                  publishAndWaitForDownStates();
                }
          

          createClusterStateWatchersAndUpdate() shouldn't be called twice, as it sets up duplicate watchers. I actually think we should just have a single call at the top, right after createClusterZkNodes() and right before joining the overseer election, so that if we get elected we have a valid ClusterState to start with.

          (For the record, I'm not super happy with the fact that external code needs to worry so much about initializing ZkStateReader exactly once)

          Attached a patch with a suggested fix.

          Show
          dragonsinth Scott Blum added a comment - - edited Mark Miller Shalin Shekhar Mangar I think there's a slight problem with the fix as landed (merge artifact?) zkStateReader.createClusterStateWatchersAndUpdate(); Stat stat = zkClient.exists(ZkStateReader.LIVE_NODES_ZKNODE, null , true ); if (stat != null && stat.getNumChildren() > 0) { zkStateReader.createClusterStateWatchersAndUpdate(); publishAndWaitForDownStates(); } createClusterStateWatchersAndUpdate() shouldn't be called twice, as it sets up duplicate watchers. I actually think we should just have a single call at the top, right after createClusterZkNodes() and right before joining the overseer election, so that if we get elected we have a valid ClusterState to start with. (For the record, I'm not super happy with the fact that external code needs to worry so much about initializing ZkStateReader exactly once) Attached a patch with a suggested fix.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Strange, I'll have to look at the code. I had though I had fixed that:

          Need to remove the duplicate call to zkStateReader.createClusterStateWatchersAndUpdate();

          Have a memory of it, but perhaps it didn't get staged right or something.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Strange, I'll have to look at the code. I had though I had fixed that: Need to remove the duplicate call to zkStateReader.createClusterStateWatchersAndUpdate(); Have a memory of it, but perhaps it didn't get staged right or something.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a9aec24236df61a3f1cfe533b64169fae84fc6f7 in lucene-solr's branch refs/heads/master from Mark Miller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a9aec24 ]

          SOLR-8696: Straighten out calls to ZkStateReader#createClusterStateWatchersAndUpdate.

          Show
          jira-bot ASF subversion and git services added a comment - Commit a9aec24236df61a3f1cfe533b64169fae84fc6f7 in lucene-solr's branch refs/heads/master from Mark Miller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a9aec24 ] SOLR-8696 : Straighten out calls to ZkStateReader#createClusterStateWatchersAndUpdate.

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              dragonsinth Scott Blum
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development