Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10277

On 'downnode', lots of wasteful mutations are done to ZK

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.3, 5.5.4, 6.0.1, 6.2.1, 6.3, 6.4.2
    • Fix Version/s: 6.5.1, 7.0
    • Component/s: SolrCloud
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:

      Description

      When a node restarts, it submits a single 'downnode' message to the overseer's state update queue.

      When the overseer processes the message, it does way more writes to ZK than necessary. In our cluster of 48 hosts, the majority of collections have only 1 shard and 1 replica. So a single node restarting should only result in ~1/40th of the collections being updated with new replica states (to indicate the node that is no longer active).

      However, the current logic in NodeMutator#downNode always updates every collection. So we end up having to do rolling restarts very slowly to avoid having a severe outage due to the overseer having to do way too much work for each host that is restarted. And subsequent shards becoming leader can't get processed until the `downnode` message is fully processed. So a fast rolling restart can result in the overseer queue growing incredibly large and nearly all shards winding up in a leader-less state until that backlog is processed.

      The fix is a trivial logic change to only add a ZkWriteCommand for collections that actually have an impacted replica.

      1. SOLR-10277.patch
        36 kB
        Shalin Shekhar Mangar
      2. SOLR-10277.patch
        35 kB
        Varun Thacker
      3. SOLR-10277-5.5.3.patch
        10 kB
        Joshua Humphries

        Issue Links

          Activity

          Hide
          varunthacker Varun Thacker added a comment - - edited

          Hi Joshua,

          However, the current logic in NodeMutator#downNode always updates every collection.

          I am checking against Solr 5.5.3 since you have listed that as the 'Affected Versions' . Looking at https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.3/solr/core/src/java/org/apache/solr/cloud/overseer/NodeMutator.java#L65 it looks to me that it only updates those replicas of collection which belong to this node. Am i missing something here?

          Show
          varunthacker Varun Thacker added a comment - - edited Hi Joshua, However, the current logic in NodeMutator#downNode always updates every collection. I am checking against Solr 5.5.3 since you have listed that as the 'Affected Versions' . Looking at https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.3/solr/core/src/java/org/apache/solr/cloud/overseer/NodeMutator.java#L65 it looks to me that it only updates those replicas of collection which belong to this node. Am i missing something here?
          Hide
          jhump Joshua Humphries added a comment -

          It's only changing properties for replicas that correspond to this node. However, if you look at the whole method, especially 10 lines below, you'll find that it's always adding a ZkWriteCommand for each collection, regardless of whether any replica properties were touched. So it generates the necessary changes and a bunch of no-op updates for every other collection.

          Show
          jhump Joshua Humphries added a comment - It's only changing properties for replicas that correspond to this node. However, if you look at the whole method, especially 10 lines below, you'll find that it's always adding a ZkWriteCommand for each collection, regardless of whether any replica properties were touched. So it generates the necessary changes and a bunch of no-op updates for every other collection.
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Joshua,

          and a bunch of no-op updates for every other collection.

          Okay I see what you're saying now. Will you want to submit a patch? Else I will take a look at it.

          Show
          varunthacker Varun Thacker added a comment - Hi Joshua, and a bunch of no-op updates for every other collection. Okay I see what you're saying now. Will you want to submit a patch? Else I will take a look at it.
          Hide
          jhump Joshua Humphries added a comment -

          I have a patch and a test for 5.5.3. Looks like this particular file (NodeMutator.java) hasn't changed and is still basically the same at HEAD, so it should be easy to apply to 6.4/6.5, too.

          I'd like to actually get my patch deployed in our cluster so I can measure its impact to rolling-restarts of the cluster.

          Show
          jhump Joshua Humphries added a comment - I have a patch and a test for 5.5.3. Looks like this particular file (NodeMutator.java) hasn't changed and is still basically the same at HEAD, so it should be easy to apply to 6.4/6.5, too. I'd like to actually get my patch deployed in our cluster so I can measure its impact to rolling-restarts of the cluster.
          Hide
          varunthacker Varun Thacker added a comment -

          Yes the patch should apply to master cleanly. You can always post the patch before testing and report back with the numbers once you have deployed it. Either way it's fine.

          I think it's an important fix for large clusters so I would like to get this in Solr 6.5

          Show
          varunthacker Varun Thacker added a comment - Yes the patch should apply to master cleanly. You can always post the patch before testing and report back with the numbers once you have deployed it. Either way it's fine. I think it's an important fix for large clusters so I would like to get this in Solr 6.5
          Hide
          jhump Joshua Humphries added a comment - - edited

          I've attached a patch file that should cleanly apply to version 5.5.3, and hopefully to other versions to.

          Show
          jhump Joshua Humphries added a comment - - edited I've attached a patch file that should cleanly apply to version 5.5.3, and hopefully to other versions to.
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Joshua,

          The patch didn't apply cleanly but was pretty simple to fix the couple of conflicts that I had with master.

          It's great that we have a test for this. We already have https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/cloud/SharedFSAutoReplicaFailoverUtilsTest.java#L83 which is similar to NodeMutatorTest#makeClusterState , but thats only because I had seen that test in the past for some other work. Should we put that method into a class like GenerateClusterStateMocks or something so thats it's more discoverable in the future and reuse that here ?

          Lastly I just want to understand the motivation behind changing ClusterState#getCollections to use a LinkedHashSet instead of a HashSet. Do you think the ordering of the entries helps us in any way here?

          Patch looks great otherwise!

          Show
          varunthacker Varun Thacker added a comment - Hi Joshua, The patch didn't apply cleanly but was pretty simple to fix the couple of conflicts that I had with master. It's great that we have a test for this. We already have https://github.com/apache/lucene-solr/blob/master/solr/core/src/test/org/apache/solr/cloud/SharedFSAutoReplicaFailoverUtilsTest.java#L83 which is similar to NodeMutatorTest#makeClusterState , but thats only because I had seen that test in the past for some other work. Should we put that method into a class like GenerateClusterStateMocks or something so thats it's more discoverable in the future and reuse that here ? Lastly I just want to understand the motivation behind changing ClusterState#getCollections to use a LinkedHashSet instead of a HashSet. Do you think the ordering of the entries helps us in any way here? Patch looks great otherwise!
          Hide
          jhump Joshua Humphries added a comment -

          Thanks for pointing out SharedFSAutoReplicaFailoverUtilsTest. I'll take a look to see how things could be consolidated a little better.

          The change in ClusterState to use a LinkedHashSet is for deterministic ordering. If you remove that part of the patch, you'll find that the test no longer passes as iteration order over the collection names becomes non-deterministic. I saw that LinkedHash* is used in a few other places and figured it would not be objectionable.

          Show
          jhump Joshua Humphries added a comment - Thanks for pointing out SharedFSAutoReplicaFailoverUtilsTest. I'll take a look to see how things could be consolidated a little better. The change in ClusterState to use a LinkedHashSet is for deterministic ordering. If you remove that part of the patch, you'll find that the test no longer passes as iteration order over the collection names becomes non-deterministic. I saw that LinkedHash* is used in a few other places and figured it would not be objectionable.
          Hide
          jhump Joshua Humphries added a comment -

          Oops. I was wrong about needing the LinkedHashSet in ClusterState.

          Looks like I solved the deterministic ordering issue twice and left both of the fixes in the patch. The other fix is where the test sorts the list of writes, and that alone suffices. So we can undo those changes in ClusterState.java.

          Show
          jhump Joshua Humphries added a comment - Oops. I was wrong about needing the LinkedHashSet in ClusterState. Looks like I solved the deterministic ordering issue twice and left both of the fixes in the patch. The other fix is where the test sorts the list of writes, and that alone suffices. So we can undo those changes in ClusterState.java.
          Hide
          jhump Joshua Humphries added a comment -

          New (slightly smaller) patch file uploaded. Undoes change to ClusterState.java which turns out to be superfluous.

          Show
          jhump Joshua Humphries added a comment - New (slightly smaller) patch file uploaded. Undoes change to ClusterState.java which turns out to be superfluous.
          Hide
          dragonsinth Scott Blum added a comment -

          Patch LGTM, any objections to merging this?

          Show
          dragonsinth Scott Blum added a comment - Patch LGTM, any objections to merging this?
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Scott,

          I was updating the patch and trying to reuse SharedFSAutoReplicaFailoverUtilsTest for the test case when it create a cluster state.

          Give me a coupe of hour. I just started working on it today evening and should have something up soon.

          You can review and commit it then.

          Show
          varunthacker Varun Thacker added a comment - Hi Scott, I was updating the patch and trying to reuse SharedFSAutoReplicaFailoverUtilsTest for the test case when it create a cluster state. Give me a coupe of hour. I just started working on it today evening and should have something up soon. You can review and commit it then.
          Hide
          dragonsinth Scott Blum added a comment -

          Sure thing, no rush! Would love to get more eyes + miles on it.

          Show
          dragonsinth Scott Blum added a comment - Sure thing, no rush! Would love to get more eyes + miles on it.
          Hide
          varunthacker Varun Thacker added a comment -

          New patch which basically takes Joshua's patch and reuses existing test code to create a cluster state.

          The test fails if we don't change NodeMutator

          We probably need to make a couple of minor changes before committing ( addressing the nocommit etc )

          Show
          varunthacker Varun Thacker added a comment - New patch which basically takes Joshua's patch and reuses existing test code to create a cluster state. The test fails if we don't change NodeMutator We probably need to make a couple of minor changes before committing ( addressing the nocommit etc )
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Scott,

          Any comments on the last patch apart from addressing the nocommit ?

          Joshua curious if you deployed this on your cluster and how much improvements did you see?

          Regardless this should be a good candidate for Solr 6.5.1 as the fix is pretty safe. Joel Bernstein let me know if you're okay to commit this to the 6_5 branch as well

          Show
          varunthacker Varun Thacker added a comment - Hi Scott, Any comments on the last patch apart from addressing the nocommit ? Joshua curious if you deployed this on your cluster and how much improvements did you see? Regardless this should be a good candidate for Solr 6.5.1 as the fix is pretty safe. Joel Bernstein let me know if you're okay to commit this to the 6_5 branch as well
          Hide
          jhump Joshua Humphries added a comment -

          Hey, Varun,

          I actually just got it into our production cluster yesterday. It reduced
          the average time to restart a node (just under 50 nodes) and have it fully
          active (e.g. overseer queue go quiet, all collections updated) from about
          2.5 minutes down to 55 seconds. Our tools to examine and verify cluster
          states also show that everything looked good.


          Josh Humphries

          FullStory <https://www.fullstory.com/> | Atlanta, GA

          Software Engineer

          jh@fullstory.com

          On Fri, Mar 31, 2017 at 2:04 PM, Varun Thacker (JIRA) <jira@apache.org>

          Show
          jhump Joshua Humphries added a comment - Hey, Varun, I actually just got it into our production cluster yesterday. It reduced the average time to restart a node (just under 50 nodes) and have it fully active (e.g. overseer queue go quiet, all collections updated) from about 2.5 minutes down to 55 seconds. Our tools to examine and verify cluster states also show that everything looked good. Josh Humphries FullStory < https://www.fullstory.com/ > | Atlanta, GA Software Engineer jh@fullstory.com On Fri, Mar 31, 2017 at 2:04 PM, Varun Thacker (JIRA) <jira@apache.org>
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          It'd be nice to release this fix in 6.5.1 – looks serious.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - It'd be nice to release this fix in 6.5.1 – looks serious.
          Hide
          dragonsinth Scott Blum added a comment -

          Agreed, Shalin Shekhar Mangar I'm actually OOO all week-- if you wanted to take point on getting this landed that would be super. I reviewed all the live code previously, but not Varun Thacker's patch to the test. (though to be honest I'm not super familiar with the test frameworks anyway)

          Show
          dragonsinth Scott Blum added a comment - Agreed, Shalin Shekhar Mangar I'm actually OOO all week-- if you wanted to take point on getting this landed that would be super. I reviewed all the live code previously, but not Varun Thacker 's patch to the test. (though to be honest I'm not super familiar with the test frameworks anyway)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          No problem, I'll review the patch and commit.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - No problem, I'll review the patch and commit.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Patch removes the nocommit and fixes the tests. I didn't remove the limitation of always having a bad replica in ClusterStateMockUtil but worked around it in the new tests. We can fix this in a different issue. I also changed the NodeMutator#downNode method slightly to remove a few extra copies and a ZK call (via clusterState.getCollection).

          I'll commit after running precommit and all tests.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Patch removes the nocommit and fixes the tests. I didn't remove the limitation of always having a bad replica in ClusterStateMockUtil but worked around it in the new tests. We can fix this in a different issue. I also changed the NodeMutator#downNode method slightly to remove a few extra copies and a ZK call (via clusterState.getCollection). I'll commit after running precommit and all tests.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 60303028debf3927e0c3abfaaa4015f73b88e689 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6030302 ]

          SOLR-10277: On 'downnode', lots of wasteful mutations are done to ZK

          Show
          jira-bot ASF subversion and git services added a comment - Commit 60303028debf3927e0c3abfaaa4015f73b88e689 in lucene-solr's branch refs/heads/master from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6030302 ] SOLR-10277 : On 'downnode', lots of wasteful mutations are done to ZK
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit ef5db56632bdc2ec73a51bf393c6670318e80bd8 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ef5db56 ]

          SOLR-10277: On 'downnode', lots of wasteful mutations are done to ZK

          (cherry picked from commit 6030302)

          Show
          jira-bot ASF subversion and git services added a comment - Commit ef5db56632bdc2ec73a51bf393c6670318e80bd8 in lucene-solr's branch refs/heads/branch_6x from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ef5db56 ] SOLR-10277 : On 'downnode', lots of wasteful mutations are done to ZK (cherry picked from commit 6030302)
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8a2a409eca3ea3de6d31188deb8b532522f8e9e6 in lucene-solr's branch refs/heads/branch_6_5 from Shalin Shekhar Mangar
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8a2a409 ]

          SOLR-10277: On 'downnode', lots of wasteful mutations are done to ZK

          (cherry picked from commit 6030302)

          (cherry picked from commit ef5db56)

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8a2a409eca3ea3de6d31188deb8b532522f8e9e6 in lucene-solr's branch refs/heads/branch_6_5 from Shalin Shekhar Mangar [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8a2a409 ] SOLR-10277 : On 'downnode', lots of wasteful mutations are done to ZK (cherry picked from commit 6030302) (cherry picked from commit ef5db56)
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Thanks everyone!

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Thanks everyone!
          Hide
          dragonsinth Scott Blum added a comment -

          Thank you Shalin!

          Show
          dragonsinth Scott Blum added a comment - Thank you Shalin!

            People

            • Assignee:
              dragonsinth Scott Blum
              Reporter:
              jhump Joshua Humphries
            • Votes:
              3 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development