Solr
  1. Solr
  2. SOLR-2358 Distributing Indexing
  3. SOLR-2880

Investigate adding an overseer that can assign shards, later do re-balancing, etc

    Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: SolrCloud
    • Labels:
      None
    1. SOLR-2880.patch
      96 kB
      Sami Siren
    2. SOLR-2880-merge-elections.patch
      23 kB
      Sami Siren
    3. SOLR-2880.patch
      21 kB
      Sami Siren
    4. SOLR-2880.patch
      45 kB
      Sami Siren
    5. SOLR-2880.patch
      29 kB
      Sami Siren
    6. SOLR-2880.patch
      30 kB
      Sami Siren
    7. SOLR-2880.patch
      60 kB
      Sami Siren
    8. SOLR-2880.patch
      22 kB
      Sami Siren

      Issue Links

        Activity

        Hide
        Sami Siren added a comment -

        This patch implements a simple overseer (cluster leader). Currently it does not do more than assign shard ids to cores so there is no added functionality. I wanted to post what I currently have here to get some feedback and improvement ideas.

        Two new top level zkNodes are created:

        /node_assignments
        /node_states

        Basically the integration point is the ZKController:

        When a core is registered its state is registered locally and nodes current state is published under /node_states

        When overseer assigns a shard id the state is stored locally (to overseer) and the assignments for the node are published under /node_assignments.

        Overseer gets the number of shards required for a collection from the collections properties in ZK. So if you wanted to add a new collection you'd create a new collection node and record the required number of shards (among other things like the used configuration) into its properties.

        Even if a node hosts multiple cores it only creates single node (/node_states/<id>) and reads a single node (/node_assignments/<id>). It might be simpler to have multiple subnodes (one for each core)

        Show
        Sami Siren added a comment - This patch implements a simple overseer (cluster leader). Currently it does not do more than assign shard ids to cores so there is no added functionality. I wanted to post what I currently have here to get some feedback and improvement ideas. Two new top level zkNodes are created: /node_assignments /node_states Basically the integration point is the ZKController: When a core is registered its state is registered locally and nodes current state is published under /node_states When overseer assigns a shard id the state is stored locally (to overseer) and the assignments for the node are published under /node_assignments. Overseer gets the number of shards required for a collection from the collections properties in ZK. So if you wanted to add a new collection you'd create a new collection node and record the required number of shards (among other things like the used configuration) into its properties. Even if a node hosts multiple cores it only creates single node (/node_states/<id>) and reads a single node (/node_assignments/<id>). It might be simpler to have multiple subnodes (one for each core)
        Hide
        Mark Miller added a comment -

        Nice, thanks Sami! I'd like to get this committed to the branch quickly so that we don't have to juggle patch updates much.

        Do all tests pass yet for you? I see the full distrb test failing - before I look into it, just wanted to make sure it's expected - if so, I'm happy to dig in.

        Also, quick comment on the node assignments zookeeper node - do we really need it? Can't we figure out the assignments from the clusterstate info? Does it add anything?

        More comments to come.

        Show
        Mark Miller added a comment - Nice, thanks Sami! I'd like to get this committed to the branch quickly so that we don't have to juggle patch updates much. Do all tests pass yet for you? I see the full distrb test failing - before I look into it, just wanted to make sure it's expected - if so, I'm happy to dig in. Also, quick comment on the node assignments zookeeper node - do we really need it? Can't we figure out the assignments from the clusterstate info? Does it add anything? More comments to come.
        Hide
        Sami Siren added a comment -

        Do all tests pass yet for you

        I think there might be atleast something timing related: I do not see failures every time. I also see some failures that seems unrelated, like:
        in SoftAutoCommitTest, CoreAdminHandlerTest#testCoreAdminHandler, TestCoreContainer#testPersist

        b1. Also, quick comment on the node assignments zookeeper node - do we really need it? Can't we figure out the assignments from the clusterstate info? Does it add anything?

        The motivation for adding it was that it might reduce the amount of data that needs to be transferred but currently it does not do even that because ZKController still uses the state -> perhaps the assignments node should be dropped.

        Show
        Sami Siren added a comment - Do all tests pass yet for you I think there might be atleast something timing related: I do not see failures every time. I also see some failures that seems unrelated, like: in SoftAutoCommitTest, CoreAdminHandlerTest#testCoreAdminHandler, TestCoreContainer#testPersist b1. Also, quick comment on the node assignments zookeeper node - do we really need it? Can't we figure out the assignments from the clusterstate info? Does it add anything? The motivation for adding it was that it might reduce the amount of data that needs to be transferred but currently it does not do even that because ZKController still uses the state -> perhaps the assignments node should be dropped.
        Hide
        Sami Siren added a comment -

        I also see some failures that seems unrelated, like: in SoftAutoCommitTest, CoreAdminHandlerTest#testCoreAdminHandler, TestCoreContainer#testPersist

        Those went away with svn up. However I now see TestRecovery fail constantly when running solr tests:

        <testcase classname="org.apache.solr.search.TestRecovery" name="testLogReplay" time="60.774">
            <failure type="junit.framework.AssertionFailedError">junit.framework.AssertionFailedError: 
                at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149)
                at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51)
                at org.apache.solr.search.TestRecovery.testLogReplay(TestRecovery.java:82)
                at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:521)
        </failure>
          </testcase>
        

        If I only run that single test it passes.

        Show
        Sami Siren added a comment - I also see some failures that seems unrelated, like: in SoftAutoCommitTest, CoreAdminHandlerTest#testCoreAdminHandler, TestCoreContainer#testPersist Those went away with svn up. However I now see TestRecovery fail constantly when running solr tests: <testcase classname= "org.apache.solr.search.TestRecovery" name= "testLogReplay" time= "60.774" > <failure type= "junit.framework.AssertionFailedError" >junit.framework.AssertionFailedError: at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:149) at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:51) at org.apache.solr.search.TestRecovery.testLogReplay(TestRecovery.java:82) at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:521) </failure> </testcase> If I only run that single test it passes.
        Hide
        Yonik Seeley added a comment -

        However I now see TestRecovery fail constantly when running solr tests:

        Probably my fault - I removed the previous version update processor and switched the tests to use the new distributed update processor. I'll try a fresh checkout (too many unrelated changes in my current copy).

        Show
        Yonik Seeley added a comment - However I now see TestRecovery fail constantly when running solr tests: Probably my fault - I removed the previous version update processor and switched the tests to use the new distributed update processor. I'll try a fresh checkout (too many unrelated changes in my current copy).
        Hide
        Mark Miller added a comment -

        Thanks Sami - all tests are passing for me at the moment.

        Is there much that is special about the overseer election code vs the leader election code? Seems there is a lot of code dupe there, both in the classes and the tests. Can we merge that up?

        Show
        Mark Miller added a comment - Thanks Sami - all tests are passing for me at the moment. Is there much that is special about the overseer election code vs the leader election code? Seems there is a lot of code dupe there, both in the classes and the tests. Can we merge that up?
        Hide
        Sami Siren added a comment -

        The attached patch merges the election code. The OverseerElectionTest should be removed because it now has nothing special to test. SliceLeaderElector should perhaps be renamed to LeaderElector or something like that.

        Show
        Sami Siren added a comment - The attached patch merges the election code. The OverseerElectionTest should be removed because it now has nothing special to test. SliceLeaderElector should perhaps be renamed to LeaderElector or something like that.
        Hide
        Mark Miller added a comment -

        Committed - thanks Sami!

        Show
        Mark Miller added a comment - Committed - thanks Sami!
        Hide
        Mark Miller added a comment -

        Would it be more convenient to store whats on the nodes in node_states and node_assigments as human readable? Like the ZkNodeProps?

        I actually just noticed we use the same ZKNodeProps for the current leader zk node as we use for the node in the cluster state - not a problem for info that doesn't change, but now that we store the "active,replicating" state there, as well as other properties that could be dynamic, it ends up with stale data as those properties change.

        Show
        Mark Miller added a comment - Would it be more convenient to store whats on the nodes in node_states and node_assigments as human readable? Like the ZkNodeProps? I actually just noticed we use the same ZKNodeProps for the current leader zk node as we use for the node in the cluster state - not a problem for info that doesn't change, but now that we store the "active,replicating" state there, as well as other properties that could be dynamic, it ends up with stale data as those properties change.
        Hide
        Sami Siren added a comment -

        Would it be more convenient to store whats on the nodes in node_states and node_assigments as human readable

        I agree, I'll change that.

        I actually just noticed we use the same ZKNodeProps for the current leader zk node as we use for the node in the cluster state

        • not a problem for info that doesn't change, but now that we store the "active,replicating" state there, as well as other properties that could be dynamic, it ends up with stale data as those properties change.

        Who uses the information from the leader node (what is actually needed there?)

        Show
        Sami Siren added a comment - Would it be more convenient to store whats on the nodes in node_states and node_assigments as human readable I agree, I'll change that. I actually just noticed we use the same ZKNodeProps for the current leader zk node as we use for the node in the cluster state not a problem for info that doesn't change, but now that we store the "active,replicating" state there, as well as other properties that could be dynamic, it ends up with stale data as those properties change. Who uses the information from the leader node (what is actually needed there?)
        Hide
        Mark Miller added a comment -

        Who uses the information from the leader node (what is actually needed there?)

        just the url right now I think.

        Show
        Mark Miller added a comment - Who uses the information from the leader node (what is actually needed there?) just the url right now I think.
        Hide
        Mark Miller added a comment -

        Why does the overseer class have it's own cloud state and watches on live nodes and stuff? Isn't this all just redundant? The ZkControllers ZkStateReader is already tracking all this stuff and should be the owner of the cloud state, shouldn't it?

        Show
        Mark Miller added a comment - Why does the overseer class have it's own cloud state and watches on live nodes and stuff? Isn't this all just redundant? The ZkControllers ZkStateReader is already tracking all this stuff and should be the owner of the cloud state, shouldn't it?
        Hide
        Sami Siren added a comment -

        Why does the overseer class have it's own cloud state and watches on live nodes and stuff?

        The watch for live nodes is also used for adding watches for node states: when a new node pops up a watch is generated for /node_states/<node-name>

        The ZkControllers ZkStateReader is already tracking all this stuff and should be the owner of the cloud state, shouldn't it?

        Yeah, makes sense. I'll see how that would work.

        Show
        Sami Siren added a comment - Why does the overseer class have it's own cloud state and watches on live nodes and stuff? The watch for live nodes is also used for adding watches for node states: when a new node pops up a watch is generated for /node_states/<node-name> The ZkControllers ZkStateReader is already tracking all this stuff and should be the owner of the cloud state, shouldn't it? Yeah, makes sense. I'll see how that would work.
        Hide
        Yonik Seeley added a comment - - edited

        For those following along, here's what the ZK layout looks like after starting 4 nodes (with the default 3 shards).

        ...../NODE_ASSIGNMENTS (v=0 children=4)
                  ROGUE:7574_SOLR (v=4) "[{ "_collection":"collection1", "_core":"Rogue:7574_solr_", "shard_name":"shard2"}]"
                  ROGUE:2222_SOLR (v=2) "[{ "_collection":"collection1", "_core":"Rogue:2222_solr_", "shard_name":"shard1"}]"
                  ROGUE:1111_SOLR (v=2) "[{ "_collection":"collection1", "_core":"Rogue:1111_solr_", "shard_name":"shard3"}]"
                  ROGUE:8983_SOLR (v=2) "[{ "_collection":"collection1", "_core":"Rogue:8983_solr_", "shard_name":"shard1"}]"
             /NODE_STATES (v=0 children=4)
                  ROGUE:7574_SOLR (v=4) "[{ "shard_id":"shard2", "_collection":"collection1", "roles":"", "_core":"Rogue:7574_solr_", "s..."
                  ROGUE:2222_SOLR (v=2) "[{ "shard_id":"shard1", "_collection":"collection1", "roles":"", "_core":"Rogue:2222_solr_", "s..."
                  ROGUE:1111_SOLR (v=2) "[{ "shard_id":"shard3", "_collection":"collection1", "roles":"", "_core":"Rogue:1111_solr_", "s..."
                  ROGUE:8983_SOLR (v=2) "[{ "shard_id":"shard1", "_collection":"collection1", "roles":"", "_core":"Rogue:8983_solr_", "s..."
             /ZOOKEEPER (v=0 children=1) ""
                  QUOTA (v=0) ""
             /CLUSTERSTATE.JSON (v=10) "{"collection1":{ "shard1":{ "Rogue:2222_solr_":{ "shard_id":"shard1", "_collection":"collection1..."
             /LIVE_NODES (v=4 children=3)
                  ROGUE:2222_SOLR (ephemeral v=0)
                  ROGUE:1111_SOLR (ephemeral v=0)
                  ROGUE:8983_SOLR (ephemeral v=0)
             /COLLECTIONS (v=1 children=1)
                  COLLECTION1 (v=0 children=2) "{ "configName":"myconf", "num_shards":"3"}"
                       SHARDS_LOCK (v=0)
                       LEADER_ELECT (v=0 children=3)
                            SHARD1 (v=0 children=2)
                                 ELECTION (v=0 children=2)
                                      N_0000000001 (ephemeral v=0)
                                      N_0000000000 (ephemeral v=0)
                                 LEADER (v=0 children=1)
                                      ROGUE:8983_SOLR_ (ephemeral v=0) "{ "shard_id":"shard1", "roles":"", "state":"recovering", "node_name":"Rogue:8983_solr", "url":"..."
                            SHARD2 (v=0 children=2)
                                 ELECTION (v=0 children=1)
                                      N_0000000000 (ephemeral v=0)
                                 LEADER (v=0 children=1)
                                      ROGUE:7574_SOLR_ (ephemeral v=0) "{ "shard_id":"shard2", "roles":"", "state":"recovering", "node_name":"Rogue:7574_solr", "url":"..."
                            SHARD3 (v=0 children=2)
                                 ELECTION (v=0 children=1)
                                      N_0000000000 (ephemeral v=0)
                                 LEADER (v=0 children=1)
                                      ROGUE:1111_SOLR_ (ephemeral v=0) "{ "shard_id":"shard3", "roles":"", "state":"recovering", "node_name":"Rogue:1111_solr", "url":"..."
             /OVERSEER_ELECT (v=0 children=2)
                  ELECTION (v=0 children=4)
                       N_0000000004 (ephemeral v=0)
                       N_0000000003 (ephemeral v=0)
                       N_0000000001 (ephemeral v=0)
                       N_0000000000 (ephemeral v=0)
                  LEADER (v=0)
        

        Notes: Starting up the first node by bootstrapping fails the first time you try. Run it again and it works.
        java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar

        Show
        Yonik Seeley added a comment - - edited For those following along, here's what the ZK layout looks like after starting 4 nodes (with the default 3 shards). ...../NODE_ASSIGNMENTS (v=0 children=4) ROGUE:7574_SOLR (v=4) "[{ " _collection ":" collection1 ", " _core ":" Rogue:7574_solr_ ", " shard_name ":" shard2 "}]" ROGUE:2222_SOLR (v=2) "[{ " _collection ":" collection1 ", " _core ":" Rogue:2222_solr_ ", " shard_name ":" shard1 "}]" ROGUE:1111_SOLR (v=2) "[{ " _collection ":" collection1 ", " _core ":" Rogue:1111_solr_ ", " shard_name ":" shard3 "}]" ROGUE:8983_SOLR (v=2) "[{ " _collection ":" collection1 ", " _core ":" Rogue:8983_solr_ ", " shard_name ":" shard1 "}]" /NODE_STATES (v=0 children=4) ROGUE:7574_SOLR (v=4) "[{ " shard_id ":" shard2 ", " _collection ":" collection1 ", " roles ":" ", " _core ":" Rogue:7574_solr_ ", " s..." ROGUE:2222_SOLR (v=2) "[{ " shard_id ":" shard1 ", " _collection ":" collection1 ", " roles ":" ", " _core ":" Rogue:2222_solr_ ", " s..." ROGUE:1111_SOLR (v=2) "[{ " shard_id ":" shard3 ", " _collection ":" collection1 ", " roles ":" ", " _core ":" Rogue:1111_solr_ ", " s..." ROGUE:8983_SOLR (v=2) "[{ " shard_id ":" shard1 ", " _collection ":" collection1 ", " roles ":" ", " _core ":" Rogue:8983_solr_ ", " s..." /ZOOKEEPER (v=0 children=1) "" QUOTA (v=0) "" /CLUSTERSTATE.JSON (v=10) "{" collection1 ":{ " shard1 ":{ " Rogue:2222_solr_ ":{ " shard_id ":" shard1 ", " _collection ":" collection1..." /LIVE_NODES (v=4 children=3) ROGUE:2222_SOLR (ephemeral v=0) ROGUE:1111_SOLR (ephemeral v=0) ROGUE:8983_SOLR (ephemeral v=0) /COLLECTIONS (v=1 children=1) COLLECTION1 (v=0 children=2) "{ " configName ":" myconf ", " num_shards ":" 3 "}" SHARDS_LOCK (v=0) LEADER_ELECT (v=0 children=3) SHARD1 (v=0 children=2) ELECTION (v=0 children=2) N_0000000001 (ephemeral v=0) N_0000000000 (ephemeral v=0) LEADER (v=0 children=1) ROGUE:8983_SOLR_ (ephemeral v=0) "{ " shard_id ":" shard1 ", " roles ":" ", " state ":" recovering ", " node_name ":" Rogue:8983_solr ", " url ":" ..." SHARD2 (v=0 children=2) ELECTION (v=0 children=1) N_0000000000 (ephemeral v=0) LEADER (v=0 children=1) ROGUE:7574_SOLR_ (ephemeral v=0) "{ " shard_id ":" shard2 ", " roles ":" ", " state ":" recovering ", " node_name ":" Rogue:7574_solr ", " url ":" ..." SHARD3 (v=0 children=2) ELECTION (v=0 children=1) N_0000000000 (ephemeral v=0) LEADER (v=0 children=1) ROGUE:1111_SOLR_ (ephemeral v=0) "{ " shard_id ":" shard3 ", " roles ":" ", " state ":" recovering ", " node_name ":" Rogue:1111_solr ", " url ":" ..." /OVERSEER_ELECT (v=0 children=2) ELECTION (v=0 children=4) N_0000000004 (ephemeral v=0) N_0000000003 (ephemeral v=0) N_0000000001 (ephemeral v=0) N_0000000000 (ephemeral v=0) LEADER (v=0) Notes: Starting up the first node by bootstrapping fails the first time you try. Run it again and it works. java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar
        Hide
        Yonik Seeley added a comment -

        Here's the data on some of the bigger nodes:

        /node_assignments/Rogue:7574_solr
        [{
            "_collection":"collection1",
            "_core":"Rogue:7574_solr_",
            "shard_name":"shard2"}]
        
        
        /node_states/Rogue:7574_solr
        [{
            "shard_id":"shard2",
            "_collection":"collection1",
            "roles":"",
            "_core":"Rogue:7574_solr_",
            "state":"active",
            "node_name":"Rogue:7574_solr",
            "url":"http://Rogue:7574/solr/"}]
        
        
        /collections/collection1/leader_elect/shard2/leader/Rogue:7574_solr_
        {
          "shard_id":"shard2",
          "roles":"",
          "state":"recovering",
          "node_name":"Rogue:7574_solr",
          "url":"http://Rogue:7574/solr/"}
        
        
        /clusterstate.json
        {"collection1":{
            "shard1":{
              "Rogue:2222_solr_":{
                "shard_id":"shard1",
                "_collection":"collection1",
                "roles":"",
                "_core":"Rogue:2222_solr_",
                "state":"active",
                "node_name":"Rogue:2222_solr",
                "url":"http://Rogue:2222/solr/"},
              "Rogue:8983_solr_":{
                "shard_id":"shard1",
                "_collection":"collection1",
                "roles":"",
                "_core":"Rogue:8983_solr_",
                "state":"active",
                "node_name":"Rogue:8983_solr",
                "url":"http://Rogue:8983/solr/"}},
            "shard2":{"Rogue:7574_solr_":{
                "shard_id":"shard2",
                "_collection":"collection1",
                "roles":"",
                "_core":"Rogue:7574_solr_",
                "state":"active",
                "node_name":"Rogue:7574_solr",
                "url":"http://Rogue:7574/solr/"}},
            "shard3":{"Rogue:1111_solr_":{
                "shard_id":"shard3",
                "_collection":"collection1",
                "roles":"",
                "_core":"Rogue:1111_solr_",
                "state":"active",
                "node_name":"Rogue:1111_solr",
                "url":"http://Rogue:1111/solr/"}}}}
        
        
        
        Show
        Yonik Seeley added a comment - Here's the data on some of the bigger nodes: /node_assignments/Rogue:7574_solr [{ "_collection" : "collection1" , "_core" : "Rogue:7574_solr_" , "shard_name" : "shard2" }] /node_states/Rogue:7574_solr [{ "shard_id" : "shard2" , "_collection" : "collection1" , "roles" :"", "_core" : "Rogue:7574_solr_" , "state" : "active" , "node_name" : "Rogue:7574_solr" , "url" : "http: //Rogue:7574/solr/" }] /collections/collection1/leader_elect/shard2/leader/Rogue:7574_solr_ { "shard_id" : "shard2" , "roles" :"", "state" : "recovering" , "node_name" : "Rogue:7574_solr" , "url" : "http: //Rogue:7574/solr/" } /clusterstate.json { "collection1" :{ "shard1" :{ "Rogue:2222_solr_" :{ "shard_id" : "shard1" , "_collection" : "collection1" , "roles" :"", "_core" : "Rogue:2222_solr_" , "state" : "active" , "node_name" : "Rogue:2222_solr" , "url" : "http: //Rogue:2222/solr/" }, "Rogue:8983_solr_" :{ "shard_id" : "shard1" , "_collection" : "collection1" , "roles" :"", "_core" : "Rogue:8983_solr_" , "state" : "active" , "node_name" : "Rogue:8983_solr" , "url" : "http: //Rogue:8983/solr/" }}, "shard2" :{ "Rogue:7574_solr_" :{ "shard_id" : "shard2" , "_collection" : "collection1" , "roles" :"", "_core" : "Rogue:7574_solr_" , "state" : "active" , "node_name" : "Rogue:7574_solr" , "url" : "http: //Rogue:7574/solr/" }}, "shard3" :{ "Rogue:1111_solr_" :{ "shard_id" : "shard3" , "_collection" : "collection1" , "roles" :"", "_core" : "Rogue:1111_solr_" , "state" : "active" , "node_name" : "Rogue:1111_solr" , "url" : "http: //Rogue:1111/solr/" }}}}
        Hide
        Yonik Seeley added a comment -

        Some random comments so far... Naming:

        • numShards vs num_shards... we should try to make system properties consistent with the names that actually appear in ZK
        • _core, _collection? why the underscores?

        I'm not sure num_shards belongs as a configuration item anywhere (in solr.xml or as a collection property in ZK). The number of shards a collection has is always just the number you see in ZK under the collection. This will make it easier for people with custom sharding to just add another shard. Whoever is creating the initial layout should thus create all of the shards at once.

        Show
        Yonik Seeley added a comment - Some random comments so far... Naming: numShards vs num_shards... we should try to make system properties consistent with the names that actually appear in ZK _core, _collection? why the underscores? I'm not sure num_shards belongs as a configuration item anywhere (in solr.xml or as a collection property in ZK). The number of shards a collection has is always just the number you see in ZK under the collection. This will make it easier for people with custom sharding to just add another shard. Whoever is creating the initial layout should thus create all of the shards at once.
        Hide
        Sami Siren added a comment -

        Here's a patch that fixes a bug that causes some state updates to be missed when overseer node crashes. It also removes the "_" prefixes from the mentioned property names and use numShards as shard count property name.

        Show
        Sami Siren added a comment - Here's a patch that fixes a bug that causes some state updates to be missed when overseer node crashes. It also removes the "_" prefixes from the mentioned property names and use numShards as shard count property name.
        Hide
        Sami Siren added a comment -

        This will make it easier for people with custom sharding to just add another shard

        Yeah, that's a good point. Does the "initial layout creation" code exist anywhere yet?

        Show
        Sami Siren added a comment - This will make it easier for people with custom sharding to just add another shard Yeah, that's a good point. Does the "initial layout creation" code exist anywhere yet?
        Hide
        Mark Miller added a comment -

        Thanks Sami! Committed.

        Show
        Mark Miller added a comment - Thanks Sami! Committed.
        Hide
        Mark Miller added a comment -

        The number of shards a collection has is always just the number you see in ZK under the collection. This will make it easier for people with custom sharding to just add another shard. Whoever is creating the initial layout should thus create all of the shards at once.

        This seems more complicated to me?

        Custom shards are easy at the moment - numShards is only for the auto sharder - other code that needs to know the number of shards (like around hashing) simply counts the number of shards. If you add another shard manually, it should be no problem.

        numShards should just be called autoNumShards or something perhaps.

        Show
        Mark Miller added a comment - The number of shards a collection has is always just the number you see in ZK under the collection. This will make it easier for people with custom sharding to just add another shard. Whoever is creating the initial layout should thus create all of the shards at once. This seems more complicated to me? Custom shards are easy at the moment - numShards is only for the auto sharder - other code that needs to know the number of shards (like around hashing) simply counts the number of shards. If you add another shard manually, it should be no problem. numShards should just be called autoNumShards or something perhaps.
        Hide
        Mark Miller added a comment -

        Now that I got my initial thought down...

        After adding your custom shard, you would then have to change the autoShardNum for further replicas to be placed correctly with the current code...

        The other option seems to be perhaps starting up all your nodes, each is a new shard, and then you make some call that says you are done adding shards. I didn't really like that compared to just passing a sys prop up front, but perhaps it is better in the end...

        Show
        Mark Miller added a comment - Now that I got my initial thought down... After adding your custom shard, you would then have to change the autoShardNum for further replicas to be placed correctly with the current code... The other option seems to be perhaps starting up all your nodes, each is a new shard, and then you make some call that says you are done adding shards. I didn't really like that compared to just passing a sys prop up front, but perhaps it is better in the end...
        Hide
        Mark Miller added a comment -

        Notes: Starting up the first node by bootstrapping fails the first time you try. Run it again and it works.
        java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar

        Whats the failure? I have not seen this.

        Show
        Mark Miller added a comment - Notes: Starting up the first node by bootstrapping fails the first time you try. Run it again and it works. java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -jar start.jar Whats the failure? I have not seen this.
        Hide
        Yonik Seeley added a comment -

        I didn't really like that compared to just passing a sys prop up front

        That's what I'm proposing - that numShards not be stored anywhere in ZK at all. Creation of a new collection requires someone creating some initial nodes (i.e. the collection node at a minimum). Currently I believe this is just the bootstrapping code

        Scenario 1: autosharding
        Whoever creates the new collection also creates a placeholder for each shard. The current bootstrapping code can look at a numShards system property, or a create collection API could have a numShards parameter.

        Scenario 2: custom sharding
        Create the new collection as normal, but just don't create any placeholders for each shard. Adding a new shard is just

        We also shouldn't rely on the number of shards to split up the hash range (except initially when creating a collection) - each shard should advertise what it's range is (in the case of autosharding or sharding by hash). This is important for future splitting of shards (i.e. you could concurrently have shards that covers 0-9, 0-4, and 5-9)

        So it seems like numShards is really just an input into new collection creation, not an intrinsic property of the collection.

        Show
        Yonik Seeley added a comment - I didn't really like that compared to just passing a sys prop up front That's what I'm proposing - that numShards not be stored anywhere in ZK at all. Creation of a new collection requires someone creating some initial nodes (i.e. the collection node at a minimum). Currently I believe this is just the bootstrapping code Scenario 1: autosharding Whoever creates the new collection also creates a placeholder for each shard. The current bootstrapping code can look at a numShards system property, or a create collection API could have a numShards parameter. Scenario 2: custom sharding Create the new collection as normal, but just don't create any placeholders for each shard. Adding a new shard is just We also shouldn't rely on the number of shards to split up the hash range (except initially when creating a collection) - each shard should advertise what it's range is (in the case of autosharding or sharding by hash). This is important for future splitting of shards (i.e. you could concurrently have shards that covers 0-9, 0-4, and 5-9) So it seems like numShards is really just an input into new collection creation, not an intrinsic property of the collection.
        Hide
        Mark Miller added a comment -

        Okay, that's more in line with what I was originally thinking. Nothing actually uses the hsarNim in zk - just something that came in with Samis changes and is floating. In terms of shards advertising ranges, certainly it's something we eventually need to store in zk - it's not necessary until we support adding/removing shards though. It's easy to add when we need it and doesn't give us anything until we do. And you will still need to determine those ranges based on the number of shards.

        Show
        Mark Miller added a comment - Okay, that's more in line with what I was originally thinking. Nothing actually uses the hsarNim in zk - just something that came in with Samis changes and is floating. In terms of shards advertising ranges, certainly it's something we eventually need to store in zk - it's not necessary until we support adding/removing shards though. It's easy to add when we need it and doesn't give us anything until we do. And you will still need to determine those ranges based on the number of shards.
        Hide
        Sami Siren added a comment - - edited

        > Nothing actually uses the hsarNim in zk

        It's used in shard assignment, if the prop does not exist in collection node the system property is used. I thought it would be nice to have it in configuration so that you'd not need to boot the node that runs overseer with the system property on. But it seems that if there will be a createCollection method somewhere that creates the initial layout the property is not needed.

        Show
        Sami Siren added a comment - - edited > Nothing actually uses the hsarNim in zk It's used in shard assignment, if the prop does not exist in collection node the system property is used. I thought it would be nice to have it in configuration so that you'd not need to boot the node that runs overseer with the system property on. But it seems that if there will be a createCollection method somewhere that creates the initial layout the property is not needed.
        Hide
        Mark Miller added a comment -

        Got you - my orig plan until we cane up with something better is that you'd always pass the syspeop and up the number when you want to reshard. Not ideal, but worked as a start. I put it in solr.xml because I was handling other props that way but didn't intend for it to stay there. Actually forgot it was there. Currently though, always passing a sys prop seems easier than changing it in zk - of course it doesnt address multiple collections served by the same instance. Long term, none of this seems ideal.

        Show
        Mark Miller added a comment - Got you - my orig plan until we cane up with something better is that you'd always pass the syspeop and up the number when you want to reshard. Not ideal, but worked as a start. I put it in solr.xml because I was handling other props that way but didn't intend for it to stay there. Actually forgot it was there. Currently though, always passing a sys prop seems easier than changing it in zk - of course it doesnt address multiple collections served by the same instance. Long term, none of this seems ideal.
        Hide
        Sami Siren added a comment -

        The attached patch removes the numShards property from the collection node. It also simplifies Overseer by converting it to use ZkStateReader instead of maintaining the read side of cloudstate internally and by removing bunch of not useful code. I also removed the unneeded IOException from method signatures that I added when I converted the serialization to json (they become obsolete when Yonik improved the ser/deser code).

        Show
        Sami Siren added a comment - The attached patch removes the numShards property from the collection node. It also simplifies Overseer by converting it to use ZkStateReader instead of maintaining the read side of cloudstate internally and by removing bunch of not useful code. I also removed the unneeded IOException from method signatures that I added when I converted the serialization to json (they become obsolete when Yonik improved the ser/deser code).
        Hide
        Sami Siren added a comment -

        The correct patch.

        Show
        Sami Siren added a comment - The correct patch.
        Hide
        Mark Miller added a comment -

        Committed. Thanks Sami!

        Show
        Mark Miller added a comment - Committed. Thanks Sami!
        Hide
        Mark Miller added a comment -

        just saw this random fail

        [junit] Testcase: testOverseerFailure(org.apache.solr.cloud.OverseerTest): Caused an ERROR
        [junit] (null)
        [junit] java.lang.NullPointerException
        [junit] at org.apache.solr.cloud.OverseerTest.waitForSliceCount(OverseerTest.java:142)
        [junit] at org.apache.solr.cloud.OverseerTest.testOverseerFailure(OverseerTest.java:319)
        [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528)
        [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
        [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTe

        Show
        Mark Miller added a comment - just saw this random fail [junit] Testcase: testOverseerFailure(org.apache.solr.cloud.OverseerTest): Caused an ERROR [junit] (null) [junit] java.lang.NullPointerException [junit] at org.apache.solr.cloud.OverseerTest.waitForSliceCount(OverseerTest.java:142) [junit] at org.apache.solr.cloud.OverseerTest.testOverseerFailure(OverseerTest.java:319) [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTe
        Hide
        Mark Miller added a comment -

        just got another:

        [junit] Testsuite: org.apache.solr.cloud.OverseerTest
        [junit] Testcase: testOverseerFailure(org.apache.solr.cloud.OverseerTest): Caused an ERROR
        [junit] KeeperErrorCode = NoNode for /node_states/node1
        [junit] org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /node_states/node1
        [junit] at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
        [junit] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        [junit] at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228)
        [junit] at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:242)
        [junit] at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:411)
        [junit] at org.apache.solr.cloud.OverseerTest.testOverseerFailure(OverseerTest.java:316)
        [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528)
        [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
        [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
        [junit] Tests run: 3, Failures: 0, Errors: 1, Time elapsed: 5.118 sec

        Show
        Mark Miller added a comment - just got another: [junit] Testsuite: org.apache.solr.cloud.OverseerTest [junit] Testcase: testOverseerFailure(org.apache.solr.cloud.OverseerTest): Caused an ERROR [junit] KeeperErrorCode = NoNode for /node_states/node1 [junit] org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /node_states/node1 [junit] at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) [junit] at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) [junit] at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1228) [junit] at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:242) [junit] at org.apache.solr.common.cloud.SolrZkClient.setData(SolrZkClient.java:411) [junit] at org.apache.solr.cloud.OverseerTest.testOverseerFailure(OverseerTest.java:316) [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57) [junit] Tests run: 3, Failures: 0, Errors: 1, Time elapsed: 5.118 sec
        Hide
        Mark Miller added a comment -

        I raised the max iterations timeout for the first fail.

        Show
        Mark Miller added a comment - I raised the max iterations timeout for the first fail.
        Hide
        Sami Siren added a comment -

        This patch
        -fixes a bug in overseer where when two cores were registered close to each other the edits for the latter would have gone to a stale cloudState.

        -the zk nodes the overseer requires are now done in single method call before the live node is created

        -some old cruft is also removed

        Show
        Sami Siren added a comment - This patch -fixes a bug in overseer where when two cores were registered close to each other the edits for the latter would have gone to a stale cloudState. -the zk nodes the overseer requires are now done in single method call before the live node is created -some old cruft is also removed
        Hide
        Sami Siren added a comment -

        the last patch missed some edits, here's a new one.

        Show
        Sami Siren added a comment - the last patch missed some edits, here's a new one.
        Hide
        Mark Miller added a comment -

        Thanks Sami - it's in.

        Show
        Mark Miller added a comment - Thanks Sami - it's in.
        Hide
        Sami Siren added a comment -

        Attached patch allows overseer to track shard leaders and update cloud state accordingly. Shard leader is marked with prop leader="true" in its cloudstate props.

        Also some of the tests are improved so that they are less likely to fail due to timing issues.

        Show
        Sami Siren added a comment - Attached patch allows overseer to track shard leaders and update cloud state accordingly. Shard leader is marked with prop leader="true" in its cloudstate props. Also some of the tests are improved so that they are less likely to fail due to timing issues.
        Hide
        Mark Miller added a comment -

        Patch applied - thanks again.

        Show
        Mark Miller added a comment - Patch applied - thanks again.
        Hide
        Sami Siren added a comment -

        another patch: got rid of CoreAssignment - ZkController now receives shardids through cloudstate, pumped up some counters to get rid of problems with timings, small improvements in OverseerTest

        Show
        Sami Siren added a comment - another patch: got rid of CoreAssignment - ZkController now receives shardids through cloudstate, pumped up some counters to get rid of problems with timings, small improvements in OverseerTest
        Hide
        Mark Miller added a comment -

        Doh - think i missed this last patch and now it doest apply at all - can you update this to the branch Sami?

        Show
        Mark Miller added a comment - Doh - think i missed this last patch and now it doest apply at all - can you update this to the branch Sami?
        Hide
        Sami Siren added a comment -

        think i missed this last patch

        It seems like it was applied in r1225550.

        Show
        Sami Siren added a comment - think i missed this last patch It seems like it was applied in r1225550.
        Hide
        Mark Miller added a comment -

        Ah, good. I was going to check that - usually I comment after I apply, so I thought I must have missed it.

        Show
        Mark Miller added a comment - Ah, good. I was going to check that - usually I comment after I apply, so I thought I must have missed it.
        Hide
        Mark Miller added a comment -

        I wonder if it makes sense for the overseer to change the state of a node in the cluster.json file when a node goes down? It's kind of odd to have it listed as active even when we know a node is out of commission. Should we add another state and start marking nodes as down there?

        Show
        Mark Miller added a comment - I wonder if it makes sense for the overseer to change the state of a node in the cluster.json file when a node goes down? It's kind of odd to have it listed as active even when we know a node is out of commission. Should we add another state and start marking nodes as down there?
        Hide
        Mark Miller added a comment -

        I'm seeing OverseerTest fail quite often now:

            [junit] Testcase: testShardAssignmentBigger(org.apache.solr.cloud.OverseerTest):	FAILED
            [junit] could not find counter for shard:null
            [junit] junit.framework.AssertionFailedError: could not find counter for shard:null
            [junit] 	at org.apache.solr.cloud.OverseerTest.testShardAssignmentBigger(OverseerTest.java:248)
            [junit] 	at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528)
            [junit] 	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165)
            [junit] 	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
        
        Show
        Mark Miller added a comment - I'm seeing OverseerTest fail quite often now: [junit] Testcase: testShardAssignmentBigger(org.apache.solr.cloud.OverseerTest): FAILED [junit] could not find counter for shard:null [junit] junit.framework.AssertionFailedError: could not find counter for shard:null [junit] at org.apache.solr.cloud.OverseerTest.testShardAssignmentBigger(OverseerTest.java:248) [junit] at org.apache.lucene.util.LuceneTestCase$3$1.evaluate(LuceneTestCase.java:528) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:165) [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
        Hide
        Mark Miller added a comment -

        I'm seeing OverseerTest fail quite often now:

        I think I have taken care of this - we now wait longer for a leader to show up when registering compared to the short wait we do for updates.

        Show
        Mark Miller added a comment - I'm seeing OverseerTest fail quite often now: I think I have taken care of this - we now wait longer for a leader to show up when registering compared to the short wait we do for updates.
        Hide
        Mark Miller added a comment -

        Thanks Sami!

        Show
        Mark Miller added a comment - Thanks Sami!

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development