Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7117

AutoAddReplicas should have a cluster wide property for controlling number of cores hosted on each node

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.1, master (7.0)
    • Component/s: None
    • Labels:
      None

      Description

      Currently when finding the best node to host the failed replicas, we respect the maxShardsPerNode property. This is not an ideal solution as it's a per collection property and we need a cluster wide property. Also using maxShardsPerNode can lead to unequal distribution of replicas across nodes.

      We should just let users use the CLUSTERPROP API to set the max number of cores to be hosted on each node and use that value while picking the node the replica will be hosted on.

      1. SOLR-7117.patch
        15 kB
        Varun Thacker
      2. SOLR-7117.patch
        14 kB
        Varun Thacker
      3. SOLR-7117.patch
        13 kB
        Varun Thacker
      4. SOLR-7117.patch
        4 kB
        Varun Thacker

        Activity

        Hide
        varunthacker Varun Thacker added a comment -

        Patch which adds a property called "maxCoresPerNode". This value is respected when finding nodes to host failed over replicas.

        I am not happy with the "maxCoresPerNode" name but that seemed like the most accurate name. "maxReplicasPerNode" might suggest this value does something similar to "maxShardsPerNode" ?

        I will add a test but wanted some feedback on the approach and the name for the property.

        Show
        varunthacker Varun Thacker added a comment - Patch which adds a property called "maxCoresPerNode". This value is respected when finding nodes to host failed over replicas. I am not happy with the "maxCoresPerNode" name but that seemed like the most accurate name. "maxReplicasPerNode" might suggest this value does something similar to "maxShardsPerNode" ? I will add a test but wanted some feedback on the approach and the name for the property.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Cool - we should add some tests though. There is some class that makes it easy to quickly test what nodes are chosen for failover.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Cool - we should add some tests though. There is some class that makes it easy to quickly test what nodes are chosen for failover.
        Hide
        varunthacker Varun Thacker added a comment -

        Thanks Mark for reviewing the patch.

        There is some class that makes it easy to quickly test what nodes are chosen for failover.

        Yeah I saw SharedFSAutoReplicaFailoverUtilsTest . It's super cool!

        New patch which adds a test for this property.

        Couple of other changes I made -

        • Modified a live node in testGetBestCreateUrlMultipleCollections(). I believe it was intended to be node2 and node1 in the first place.
        • In this should the check be only for node3 since node2 is already hosting the replica. Unless "r2-4" meant node2 or node4
            result = buildClusterState("csr1R*r2-4sr3r4r5", NODE1, NODE2, NODE3);
            createUrl = OverseerAutoReplicaFailoverThread.getBestCreateUrl(result.reader, result.badReplica, null);
            assertTrue(createUrl.equals(NODE2_URL) || createUrl.equals(NODE3_URL));
        

        Should we keep the property name "maxCoresPerNode" or something else?

        Show
        varunthacker Varun Thacker added a comment - Thanks Mark for reviewing the patch. There is some class that makes it easy to quickly test what nodes are chosen for failover. Yeah I saw SharedFSAutoReplicaFailoverUtilsTest . It's super cool! New patch which adds a test for this property. Couple of other changes I made - Modified a live node in testGetBestCreateUrlMultipleCollections() . I believe it was intended to be node2 and node1 in the first place. In this should the check be only for node3 since node2 is already hosting the replica. Unless "r2-4" meant node2 or node4 result = buildClusterState( "csr1R*r2-4sr3r4r5" , NODE1, NODE2, NODE3); createUrl = OverseerAutoReplicaFailoverThread.getBestCreateUrl(result.reader, result.badReplica, null ); assertTrue(createUrl.equals(NODE2_URL) || createUrl.equals(NODE3_URL)); Should we keep the property name "maxCoresPerNode" or something else?
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        In this should the check be only for node3 since node2 is already hosting the replica. Unless "r2-4" meant node2 or node4

        I think that '-' notation has been removed (it's probably just ignored now). I think you used to specify the state that way. -4 prob would have been 'recovery failed' state.

        But yeah, looks like it should just be NODE3 with that not working anymore.

        I'm guessing it's meant to be (now): "csr1R*r2Fsr3r4r5"

        Should we keep the property name "maxCoresPerNode"

        Sounds reasonable to me.

        Show
        markrmiller@gmail.com Mark Miller added a comment - In this should the check be only for node3 since node2 is already hosting the replica. Unless "r2-4" meant node2 or node4 I think that '-' notation has been removed (it's probably just ignored now). I think you used to specify the state that way. -4 prob would have been 'recovery failed' state. But yeah, looks like it should just be NODE3 with that not working anymore. I'm guessing it's meant to be (now): "csr1R*r2Fsr3r4r5" Should we keep the property name "maxCoresPerNode" Sounds reasonable to me.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Hmm...don't we want to respect both collection max per node and the universal max per node?

        Since we respect the collection max on collection create and store it, I would first argue we should probably keep it in consideration.

        If we were to just stop respecting it and only the universal limit, we have to consider the affect on back compat for a 5.1 upgrade.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Hmm...don't we want to respect both collection max per node and the universal max per node? Since we respect the collection max on collection create and store it, I would first argue we should probably keep it in consideration. If we were to just stop respecting it and only the universal limit, we have to consider the affect on back compat for a 5.1 upgrade.
        Hide
        varunthacker Varun Thacker added a comment -

        Hmm...don't we want to respect both collection max per node and the universal max per node?

        Makes sense I guess. I will update the patch which checks for both maxShardsPerNode and maxCoresPerNode.

        Show
        varunthacker Varun Thacker added a comment - Hmm...don't we want to respect both collection max per node and the universal max per node? Makes sense I guess. I will update the patch which checks for both maxShardsPerNode and maxCoresPerNode.
        Hide
        varunthacker Varun Thacker added a comment -

        Updated patch.

        Now we respect both "maxShardsPerNode" and "maxCoresPerNode". So if we hit any one of those that node is not picked as a recovery node.

        Show
        varunthacker Varun Thacker added a comment - Updated patch. Now we respect both "maxShardsPerNode" and "maxCoresPerNode". So if we hit any one of those that node is not picked as a recovery node.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Great, thanks Varun.

        Show
        markrmiller@gmail.com Mark Miller added a comment - Great, thanks Varun.
        Hide
        markrmiller@gmail.com Mark Miller added a comment -

        Hey Varun Thacker, think we can commit this?

        Show
        markrmiller@gmail.com Mark Miller added a comment - Hey Varun Thacker , think we can commit this?
        Hide
        varunthacker Varun Thacker added a comment -

        Oh yes we should Completely lost track of it.

        I'll bring the patch up to trunk and run the tests

        Show
        varunthacker Varun Thacker added a comment - Oh yes we should Completely lost track of it. I'll bring the patch up to trunk and run the tests
        Hide
        varunthacker Varun Thacker added a comment -

        Updated patch against master. I'll commit this shortly

        Show
        varunthacker Varun Thacker added a comment - Updated patch against master. I'll commit this shortly
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit a969d99ce92be36ef8e155218701696d1742f47d in lucene-solr's branch refs/heads/master from Varun Thacker
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a969d99 ]

        SOLR-7117: AutoAddReplicas should have a cluster wide property for controlling number of cores hosted on each node

        Show
        jira-bot ASF subversion and git services added a comment - Commit a969d99ce92be36ef8e155218701696d1742f47d in lucene-solr's branch refs/heads/master from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a969d99 ] SOLR-7117 : AutoAddReplicas should have a cluster wide property for controlling number of cores hosted on each node
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 8e999d1a966622980bba43bfcd14bb2218b25a2f in lucene-solr's branch refs/heads/branch_6x from Varun Thacker
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8e999d1 ]

        SOLR-7117: AutoAddReplicas should have a cluster wide property for controlling number of cores hosted on each node

        Show
        jira-bot ASF subversion and git services added a comment - Commit 8e999d1a966622980bba43bfcd14bb2218b25a2f in lucene-solr's branch refs/heads/branch_6x from Varun Thacker [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8e999d1 ] SOLR-7117 : AutoAddReplicas should have a cluster wide property for controlling number of cores hosted on each node
        Hide
        varunthacker Varun Thacker added a comment -

        Thanks Mark!

        Show
        varunthacker Varun Thacker added a comment - Thanks Mark!
        Hide
        hossman Hoss Man added a comment -

        Manually correcting fixVersion per Step #S5 of LUCENE-7271

        Show
        hossman Hoss Man added a comment - Manually correcting fixVersion per Step #S5 of LUCENE-7271

          People

          • Assignee:
            varunthacker Varun Thacker
            Reporter:
            varunthacker Varun Thacker
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development