Solr
  1. Solr
  2. SOLR-3488

Create a Collections API for SolrCloud

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: SolrCloud
    • Labels:
      None
    1. SOLR-3488_2.patch
      31 kB
      Tommaso Teofili
    2. SOLR-3488.patch
      51 kB
      Mark Miller
    3. SOLR-3488.patch
      44 kB
      Mark Miller
    4. SOLR-3488.patch
      21 kB
      Mark Miller

      Issue Links

        Activity

        Hide
        Mark Miller added a comment - - edited

        I'll post an initial patch just for create soon. It's just a start though. I've added a bunch of comments for TODOs or things to consider for the future. I'd like to start simple just to get 'something' in though.

        So initially, you can create a new collection and pass an existing collection name to determine which shards it's created on. Would also be nice to be able to explicitly pass the shard urls to use, as well as simply offer X shards, Y replicas. In that case, perhaps the leader overseer could handle ensuring that. You might also want to be able to simply say, create it on all known shards.

        Further things to look at:

        • other commands, like remove/delete.
        • what to do when some create calls fail? should we instead add a create node to a queue in zookeeper? Make the overseer responsible for checking for any jobs there, completing them (if needed) and then removing the job from the queue? Other ideas.
        Show
        Mark Miller added a comment - - edited I'll post an initial patch just for create soon. It's just a start though. I've added a bunch of comments for TODOs or things to consider for the future. I'd like to start simple just to get 'something' in though. So initially, you can create a new collection and pass an existing collection name to determine which shards it's created on. Would also be nice to be able to explicitly pass the shard urls to use, as well as simply offer X shards, Y replicas. In that case, perhaps the leader overseer could handle ensuring that. You might also want to be able to simply say, create it on all known shards. Further things to look at: other commands, like remove/delete. what to do when some create calls fail? should we instead add a create node to a queue in zookeeper? Make the overseer responsible for checking for any jobs there, completing them (if needed) and then removing the job from the queue? Other ideas.
        Hide
        Sami Siren added a comment -

        should we instead add a create node to a queue in zookeeper? Make the overseer responsible for checking for any jobs there, completing them (if needed) and then removing the job from the queue?

        I like this idea, i would also refactor current zkcontroller->overseer communication to use this same technique.

        Show
        Sami Siren added a comment - should we instead add a create node to a queue in zookeeper? Make the overseer responsible for checking for any jobs there, completing them (if needed) and then removing the job from the queue? I like this idea, i would also refactor current zkcontroller->overseer communication to use this same technique.
        Hide
        Yonik Seeley added a comment - - edited

        should we instead add a create node to a queue in zookeeper?

        Yeah, a work queue in ZK makes perfect sense. Perhaps serialize the params to a JSON map/object per line?
        edit: or perhaps it makes more sense for each operation to be a separate file (which is what I think you wrote anyway)

        Possible parameters:

        • name of the collection
        • the config for the collection
        • number of shards in the new collection
        • default replication factor

        Operations:

        • add a collection
        • remove a collection
        • different options here... leave cores up, bring cores down, completely remove cores (and data)
        • change collection properties (replication factor, config)
        • expand collection (split shards)
        • add/remove a collection alias

        Shard operations:

        • add a shard (more for custom sharding)
        • remove a shard
        • change shard properties (replication factor)
        • split a shard
        • add/remove a shard alias
        Show
        Yonik Seeley added a comment - - edited should we instead add a create node to a queue in zookeeper? Yeah, a work queue in ZK makes perfect sense. Perhaps serialize the params to a JSON map/object per line? edit: or perhaps it makes more sense for each operation to be a separate file (which is what I think you wrote anyway) Possible parameters: name of the collection the config for the collection number of shards in the new collection default replication factor Operations: add a collection remove a collection different options here... leave cores up, bring cores down, completely remove cores (and data) change collection properties (replication factor, config) expand collection (split shards) add/remove a collection alias Shard operations: add a shard (more for custom sharding) remove a shard change shard properties (replication factor) split a shard add/remove a shard alias
        Hide
        Mark Miller added a comment -

        I'm going on vacation for a week, so here is my early work on just getting something basic going. It does not involved any overseer stuff yet.

        Someone feel free to take it - commit it and iterate, or iterate in patch form - whatever makes sense. I'll help when I get back if there is more to do, and if no one makes any progress, I'll continue on it when I get back.

        Currently, I've copied the core admin handler pattern and made a collections handler. There is one simple test and currently the only way to choose which nodes the collection is put on is to give an existing template collection.

        The test asserts nothing at the moment - all very early work. But I imagine we will be changing direction a fair amount, so that's good I think.

        Show
        Mark Miller added a comment - I'm going on vacation for a week, so here is my early work on just getting something basic going. It does not involved any overseer stuff yet. Someone feel free to take it - commit it and iterate, or iterate in patch form - whatever makes sense. I'll help when I get back if there is more to do, and if no one makes any progress, I'll continue on it when I get back. Currently, I've copied the core admin handler pattern and made a collections handler. There is one simple test and currently the only way to choose which nodes the collection is put on is to give an existing template collection. The test asserts nothing at the moment - all very early work. But I imagine we will be changing direction a fair amount, so that's good I think.
        Hide
        Tommaso Teofili added a comment -

        slight improvements to Mark's patch.
        Regarding the template based creation I think it should use a different parameter name for the collection template (e.g. "template") and use the "collection" parameter for the new collection name.
        Apart from that I think it may be useful to clearly define different creation strategies (I've created an interface for that), the right one is chosen on the basis of the passed HTTP parameters.

        Show
        Tommaso Teofili added a comment - slight improvements to Mark's patch. Regarding the template based creation I think it should use a different parameter name for the collection template (e.g. "template") and use the "collection" parameter for the new collection name. Apart from that I think it may be useful to clearly define different creation strategies (I've created an interface for that), the right one is chosen on the basis of the passed HTTP parameters.
        Hide
        Mark Miller added a comment -

        Thanks Tommaso!

        Regarding the template based creation I think it should use a different parameter name for the collection template (e.g. "template") and use the "collection" parameter for the new collection name.

        I'm actually hoping that perhaps that stuff is temporary, and I just did it to have something that works now. I think though, that we should really change how things work - so that you just pass the number of shards and the number of replicas, and the overseer just ensures the collection is on the right number of nodes. Then we don't have to have this 'template' collection to figure out what nodes to create on - or explicitly pass the nodes.

        Sami has a distributed work queue for the overseer setup now, and I'm working on integrating this with that.

        Show
        Mark Miller added a comment - Thanks Tommaso! Regarding the template based creation I think it should use a different parameter name for the collection template (e.g. "template") and use the "collection" parameter for the new collection name. I'm actually hoping that perhaps that stuff is temporary, and I just did it to have something that works now. I think though, that we should really change how things work - so that you just pass the number of shards and the number of replicas, and the overseer just ensures the collection is on the right number of nodes. Then we don't have to have this 'template' collection to figure out what nodes to create on - or explicitly pass the nodes. Sami has a distributed work queue for the overseer setup now, and I'm working on integrating this with that.
        Hide
        Tommaso Teofili added a comment -

        I think though, that we should really change how things work - so that you just pass the number of shards and the number of replicas, and the overseer just ensures the collection is on the right number of nodes. Then we don't have to have this 'template' collection to figure out what nodes to create on - or explicitly pass the nodes.

        sure, +1.

        Sami has a distributed work queue for the overseer setup now, and I'm working on integrating this with that.

        that looks great. By the way, I think it would be good if that could be also (optionally) used for indexing in SolrCloud.

        Show
        Tommaso Teofili added a comment - I think though, that we should really change how things work - so that you just pass the number of shards and the number of replicas, and the overseer just ensures the collection is on the right number of nodes. Then we don't have to have this 'template' collection to figure out what nodes to create on - or explicitly pass the nodes. sure, +1. Sami has a distributed work queue for the overseer setup now, and I'm working on integrating this with that. that looks great. By the way, I think it would be good if that could be also (optionally) used for indexing in SolrCloud.
        Hide
        Mark Miller added a comment -

        I've got my first patch ready - still some things to address, but it currently does queue based collection creation.

        One thing I recently realized when I put some last minute pieces together is that I cannot share the same Overseer queue that already exists - it will cause a deadlock as we wait for states to be registered. Processing the queue with multiple threads still seemed scary if you were to create a lot of collections at once - so it seems just safer to use a different queue.

        I'm still somewhat unsure about handing failures though - for the moment I'm simply adding the job back onto the queue - this gets complicated quickly though. Especially if you add in delete collection and it can fail. If you start putting commands back on the queue you could have weird create/delete command retry reordering?

        I also have not switched to requiring or respecting a replication factor - I was thinking perhaps specifying nothing or -1 would give you what you have now? An infinite rep factor? And we would enforce a lower rep factor if requested? For now I still require that you pass a collection template and new nodes are created on the nodes that host the collection template.

        I'm not sure how replication factor would be enforced though? The Oveerseer just periodically prunes and adds given what it sees and what the rep factor is? Is that how failures should be handled? Don't readd to the queue, just let the periodic job attempt to fix things later?

        What if someone starts a new node with a new replicas pre configured in solr.xml? The Overseer periodic job would simply remove them shortly thereafter if the rep factor was not high enough?

        One issue with pruning at the moment is that unloading a core will not remove it's data dir. We probably want to fix that for collection removal.

        If we go too far down this path, it seems rebalancing starts to become important as well.

        I've got some other thoughts and ideas to get down, but that is a start so I can gather some feedback.

        I have not yet integrated Tomasso's work, but will if we don't end up changing things much from now.

        Show
        Mark Miller added a comment - I've got my first patch ready - still some things to address, but it currently does queue based collection creation. One thing I recently realized when I put some last minute pieces together is that I cannot share the same Overseer queue that already exists - it will cause a deadlock as we wait for states to be registered. Processing the queue with multiple threads still seemed scary if you were to create a lot of collections at once - so it seems just safer to use a different queue. I'm still somewhat unsure about handing failures though - for the moment I'm simply adding the job back onto the queue - this gets complicated quickly though. Especially if you add in delete collection and it can fail. If you start putting commands back on the queue you could have weird create/delete command retry reordering? I also have not switched to requiring or respecting a replication factor - I was thinking perhaps specifying nothing or -1 would give you what you have now? An infinite rep factor? And we would enforce a lower rep factor if requested? For now I still require that you pass a collection template and new nodes are created on the nodes that host the collection template. I'm not sure how replication factor would be enforced though? The Oveerseer just periodically prunes and adds given what it sees and what the rep factor is? Is that how failures should be handled? Don't readd to the queue, just let the periodic job attempt to fix things later? What if someone starts a new node with a new replicas pre configured in solr.xml? The Overseer periodic job would simply remove them shortly thereafter if the rep factor was not high enough? One issue with pruning at the moment is that unloading a core will not remove it's data dir. We probably want to fix that for collection removal. If we go too far down this path, it seems rebalancing starts to become important as well. I've got some other thoughts and ideas to get down, but that is a start so I can gather some feedback. I have not yet integrated Tomasso's work, but will if we don't end up changing things much from now.
        Hide
        Yonik Seeley added a comment -

        One thing I recently realized when I put some last minute pieces together is that I cannot share the same Overseer queue that already exists - it will cause a deadlock as we wait for states to be registered. Processing the queue with multiple threads still seemed scary if you were to create a lot of collections at once - so it seems just safer to use a different queue.

        I'm sure you guys have thought about the command queue more than me at this point... but some brainstorming off the top of my head:

        • The type of request could implicitly be be synchronous (must complete before moving to the next request) or asynchronous
        • even for an asynchronous command, the executor for certain types of commands could be limited to a single thread to avoid complexity (not sure if that helps your deadlock situation or not)

        Telling when something is done:

        • could have a flag that requests the command be put on a completed queue, along with any completion info.
        • to prevent unbounded growth of a completed queue, it could have a limited size (it could be pretty useful to see a recent history of operations)

        The work queue thing could be a public interface (i.e. external management systems could directly make use of it), in which case we'd want to document it well eventually.

        Show
        Yonik Seeley added a comment - One thing I recently realized when I put some last minute pieces together is that I cannot share the same Overseer queue that already exists - it will cause a deadlock as we wait for states to be registered. Processing the queue with multiple threads still seemed scary if you were to create a lot of collections at once - so it seems just safer to use a different queue. I'm sure you guys have thought about the command queue more than me at this point... but some brainstorming off the top of my head: The type of request could implicitly be be synchronous (must complete before moving to the next request) or asynchronous even for an asynchronous command, the executor for certain types of commands could be limited to a single thread to avoid complexity (not sure if that helps your deadlock situation or not) Telling when something is done: could have a flag that requests the command be put on a completed queue, along with any completion info. to prevent unbounded growth of a completed queue, it could have a limited size (it could be pretty useful to see a recent history of operations) The work queue thing could be a public interface (i.e. external management systems could directly make use of it), in which case we'd want to document it well eventually.
        Hide
        Mark Miller added a comment -

        updated patch - some refactoring and started adding remove collection code - though currently we do not remove all collection info from zk even when you unload every shard - something we should probably start doing?

        Show
        Mark Miller added a comment - updated patch - some refactoring and started adding remove collection code - though currently we do not remove all collection info from zk even when you unload every shard - something we should probably start doing?
        Hide
        Tommaso Teofili added a comment -

        currently we do not remove all collection info from zk even when you unload every shard - something we should probably start doing?

        +1

        I have not yet integrated Tomasso's work, but will if we don't end up changing things much from now.

        it should be easy to merge but I think that it'd be also good to start committing your patch and improve things on SVN from now on to ease code review (no patch merging) and concurrent works.

        Show
        Tommaso Teofili added a comment - currently we do not remove all collection info from zk even when you unload every shard - something we should probably start doing? +1 I have not yet integrated Tomasso's work, but will if we don't end up changing things much from now. it should be easy to merge but I think that it'd be also good to start committing your patch and improve things on SVN from now on to ease code review (no patch merging) and concurrent works.
        Hide
        Sami Siren added a comment -

        Mark, nice work.

        I'm still somewhat unsure about handing failures though...

        IMO Fail fast: at minimum an error should be reported back (the completed queue Yonik mentions?). It seems that in the latest patch even in case of failure the job is removed from queue.

        I also have not switched to requiring or respecting a replication factor - I was thinking perhaps specifying nothing or -1 would give you what you have now? An infinite rep factor? And we would enforce a lower rep factor if requested?

        Sounds good to me.

        I'm not sure how replication factor would be enforced though? The Oveerseer just periodically prunes and adds given what it sees and what the rep factor is? Is that how failures should be handled? Don't readd to the queue, just let the periodic job attempt to fix things later?

        I would first implement the simplest? case first where if not enough nodes are available to meet #shards and/or #replication factor: report error to user and do not try to create the collection. Or did you mean at runtime after the collection has been created?

        I have one question about the patch specifically in the OverseerCollectionProcessor where you create the collection: why do you need the collection param?
        In context of creating N * R cluster: why don't you just go though live nodes to find available nodes and perhaps then based on some "strategy" class create specific shards (with shardids) to specific nodes? The rest of the overseer would have to respect that same strategy (instead of the dummy AssignShard that is now used) so that things would not break when new nodes are attached to the collection. Perhaps this "strategy" could also handle things like time based sharding and whatnot?

        it should be easy to merge but I think that it'd be also good to start committing your patch and improve things on SVN from now on to ease code review (no patch merging) and concurrent works.

        +1 for committing this as is, there are some minor weak spots in the current patch like checking the input for the collections api requests (unexisitng params cause OverseerCollectionProcessor to die with NPE), reporting back input errors etc. put lets just put this in and open more jira issues to cover the improvement tasks and bugs?

        One more thing: I am seeing BasicDistributedZkTest failing (not just sporadically), nut sure if it is related, with the following error:

         [junit4] ERROR   0.00s J1 | BasicDistributedZkTest (suite)
           [junit4]    > Throwable #1: java.lang.AssertionError: ERROR: SolrIndexSearcher opens=496 closes=494
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([F1C0A91EB78BAB39]:0)
           [junit4]    > 	at org.junit.Assert.fail(Assert.java:93)
           [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:190)
        
        Show
        Sami Siren added a comment - Mark, nice work. I'm still somewhat unsure about handing failures though... IMO Fail fast: at minimum an error should be reported back (the completed queue Yonik mentions?). It seems that in the latest patch even in case of failure the job is removed from queue. I also have not switched to requiring or respecting a replication factor - I was thinking perhaps specifying nothing or -1 would give you what you have now? An infinite rep factor? And we would enforce a lower rep factor if requested? Sounds good to me. I'm not sure how replication factor would be enforced though? The Oveerseer just periodically prunes and adds given what it sees and what the rep factor is? Is that how failures should be handled? Don't readd to the queue, just let the periodic job attempt to fix things later? I would first implement the simplest? case first where if not enough nodes are available to meet #shards and/or #replication factor: report error to user and do not try to create the collection. Or did you mean at runtime after the collection has been created? I have one question about the patch specifically in the OverseerCollectionProcessor where you create the collection: why do you need the collection param? In context of creating N * R cluster: why don't you just go though live nodes to find available nodes and perhaps then based on some "strategy" class create specific shards (with shardids) to specific nodes? The rest of the overseer would have to respect that same strategy (instead of the dummy AssignShard that is now used) so that things would not break when new nodes are attached to the collection. Perhaps this "strategy" could also handle things like time based sharding and whatnot? it should be easy to merge but I think that it'd be also good to start committing your patch and improve things on SVN from now on to ease code review (no patch merging) and concurrent works. +1 for committing this as is, there are some minor weak spots in the current patch like checking the input for the collections api requests (unexisitng params cause OverseerCollectionProcessor to die with NPE), reporting back input errors etc. put lets just put this in and open more jira issues to cover the improvement tasks and bugs? One more thing: I am seeing BasicDistributedZkTest failing (not just sporadically), nut sure if it is related, with the following error: [junit4] ERROR 0.00s J1 | BasicDistributedZkTest (suite) [junit4] > Throwable #1: java.lang.AssertionError: ERROR: SolrIndexSearcher opens=496 closes=494 [junit4] > at __randomizedtesting.SeedInfo.seed([F1C0A91EB78BAB39]:0) [junit4] > at org.junit.Assert.fail(Assert.java:93) [junit4] > at org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:190)
        Hide
        Mark Miller added a comment -

        It seems that in the latest patch even in case of failure the job is removed from queue.

        Right - I was putting it back on the queue, but once I added deletes, I removed that because I was worried about reorderings. I figure we may need a different strategy in general. I'll expand on that in a new comment.

        report error to user and do not try to create the collection

        Yeah, that is one option - then we have to remove the collection on the other nodes though. For instance, what happens if one of the create cores calls fails due to an intermittent connection error. Do we fail then? We would need to clean up first. Then what if one of those nodes fails before we could remove it. And then comes back with that core later. I agree that simple might be the best bet to start, but in failure scenarios it gets a little muddy quickly. Which may be fine to start as you suggest.

        I have one question about the patch specifically in the OverseerCollectionProcessor where you create the collection: why do you need the collection param?

        Mostly just simplicity to start - getting the nodes based on a template collection was easy. Tommaso did some work on extracting a strategy class, but I have not yet integrated it. Certainly we need more options at a minimum, or perhaps just a different strategy. Simplest might be a way to go, but it also might be a back compat problem if we choose to do something else. I'll try and elaborate in a new comment a bit later today.

        and improve things on SVN from now

        Okay, that sounds fine to me. I'll try and polish the patch a smidgen and commit it as a start soon.

        Show
        Mark Miller added a comment - It seems that in the latest patch even in case of failure the job is removed from queue. Right - I was putting it back on the queue, but once I added deletes, I removed that because I was worried about reorderings. I figure we may need a different strategy in general. I'll expand on that in a new comment. report error to user and do not try to create the collection Yeah, that is one option - then we have to remove the collection on the other nodes though. For instance, what happens if one of the create cores calls fails due to an intermittent connection error. Do we fail then? We would need to clean up first. Then what if one of those nodes fails before we could remove it. And then comes back with that core later. I agree that simple might be the best bet to start, but in failure scenarios it gets a little muddy quickly. Which may be fine to start as you suggest. I have one question about the patch specifically in the OverseerCollectionProcessor where you create the collection: why do you need the collection param? Mostly just simplicity to start - getting the nodes based on a template collection was easy. Tommaso did some work on extracting a strategy class, but I have not yet integrated it. Certainly we need more options at a minimum, or perhaps just a different strategy. Simplest might be a way to go, but it also might be a back compat problem if we choose to do something else. I'll try and elaborate in a new comment a bit later today. and improve things on SVN from now Okay, that sounds fine to me. I'll try and polish the patch a smidgen and commit it as a start soon.
        Hide
        Mark Miller added a comment -

        Perhaps its a little too ambitious, but the reason I brought up the idea of the overseer handling collection management every n seconds is:

        Lets say you have 4 nodes with 2 collections on them. You want each collection to use as many nodes as are available. Now you want to add a new node. To get it to participate in the existing collections, you have to configure them, or create new compatible cores over http on the new node. Wouldn't it be nice if the Overseer just saw the new node, that the collections had repFactor=MAX_INT and created the cores for you?

        Also, consider failure scenarios:

        If you remove a collection, what happens when a node that was down comes back and had that a piece of that collection? Your collection will be back as a single node. An Overseer process could prune this off shortly after.

        So numShards/repFactor + Overseeer smarts seems simple and good to me. But sometimes you may want to be precise in picking shards/repliacs. Perhaps simply doing some kind of 'rack awareness' type feature down the road is the best way to control this though. You could create connections and weight costs using token markers for each node or something.

        So I think maybe we would need a new zk node where solr instances register rather than cores? then we know what is available to place replicas on - even if that Solr instance has no cores?

        Then the Overseer would have a process that ran every n (1 min?) and looked at each collection and its repFactor and numShards, and add or prune given the current state.

        This would also account for failures on collection creation or deletion. If a node was down and missed the operation, when it came back, within N seconds, the Overseer would add or prune with the restored node.

        It handles a lot of failures scenarios (with some lag) and makes the interface to the user a lot simpler. Adding nodes can eventually mean just starting up a node new rather than requiring any config. It's also easy to deal with changing the replication factor. Just update it in zk, and when the Overseer process runs next, it will add and prune to match the latest value (given the number of nodes available).

        Show
        Mark Miller added a comment - Perhaps its a little too ambitious, but the reason I brought up the idea of the overseer handling collection management every n seconds is: Lets say you have 4 nodes with 2 collections on them. You want each collection to use as many nodes as are available. Now you want to add a new node. To get it to participate in the existing collections, you have to configure them, or create new compatible cores over http on the new node. Wouldn't it be nice if the Overseer just saw the new node, that the collections had repFactor=MAX_INT and created the cores for you? Also, consider failure scenarios: If you remove a collection, what happens when a node that was down comes back and had that a piece of that collection? Your collection will be back as a single node. An Overseer process could prune this off shortly after. So numShards/repFactor + Overseeer smarts seems simple and good to me. But sometimes you may want to be precise in picking shards/repliacs. Perhaps simply doing some kind of 'rack awareness' type feature down the road is the best way to control this though. You could create connections and weight costs using token markers for each node or something. So I think maybe we would need a new zk node where solr instances register rather than cores? then we know what is available to place replicas on - even if that Solr instance has no cores? Then the Overseer would have a process that ran every n (1 min?) and looked at each collection and its repFactor and numShards, and add or prune given the current state. This would also account for failures on collection creation or deletion. If a node was down and missed the operation, when it came back, within N seconds, the Overseer would add or prune with the restored node. It handles a lot of failures scenarios (with some lag) and makes the interface to the user a lot simpler. Adding nodes can eventually mean just starting up a node new rather than requiring any config. It's also easy to deal with changing the replication factor. Just update it in zk, and when the Overseer process runs next, it will add and prune to match the latest value (given the number of nodes available).
        Hide
        Tommaso Teofili added a comment -

        Lets say you have 4 nodes with 2 collections on them. You want each collection to use as many nodes as are available. Now you want to add a new node. To get it to participate in the existing collections, you have to configure them, or create new compatible cores over http on the new node. Wouldn't it be nice if the Overseer just saw the new node, that the collections had repFactor=MAX_INT and created the cores for you?

        sure, that'd be nice indeed. Maybe this should be configurable (a param like greedy=boolean)

        If you remove a collection, what happens when a node that was down comes back and had that a piece of that collection? Your collection will be back as a single node. An Overseer process could prune this off shortly after.

        so basically, if I understood correctly, is that the overseer has the capability of doing periodic checks without an explicit action / request from a client which can help on cleaning states / check for failures / etc.

        So I think maybe we would need a new zk node where solr instances register rather than cores? then we know what is available to place replicas on - even if that Solr instance has no cores?

        wouldn't be possible to store the instances information in the same zk node? Because otherwise we could've to also check the two nodes are in consistent states (I don't know Zookeeper very much so I may be wrong here)

        Show
        Tommaso Teofili added a comment - Lets say you have 4 nodes with 2 collections on them. You want each collection to use as many nodes as are available. Now you want to add a new node. To get it to participate in the existing collections, you have to configure them, or create new compatible cores over http on the new node. Wouldn't it be nice if the Overseer just saw the new node, that the collections had repFactor=MAX_INT and created the cores for you? sure, that'd be nice indeed. Maybe this should be configurable (a param like greedy=boolean) If you remove a collection, what happens when a node that was down comes back and had that a piece of that collection? Your collection will be back as a single node. An Overseer process could prune this off shortly after. so basically, if I understood correctly, is that the overseer has the capability of doing periodic checks without an explicit action / request from a client which can help on cleaning states / check for failures / etc. So I think maybe we would need a new zk node where solr instances register rather than cores? then we know what is available to place replicas on - even if that Solr instance has no cores? wouldn't be possible to store the instances information in the same zk node? Because otherwise we could've to also check the two nodes are in consistent states (I don't know Zookeeper very much so I may be wrong here)
        Hide
        Mark Miller added a comment -

        so basically, if I understood correctly, is that the overseer has the capability of doing periodic checks without an explicit action / request from a client which can help on cleaning states / check for failures / etc.

        Yeah - basically, either every n seconds, or when the overseer sees a new node come or go, it looks at each collection, checks its replication factor, and either adds or removes nodes to match it given the nodes that are currently up. So with some lag, whatever you set for the replication will eventually be matched no matter the failures or random state of the cluster when the collection is created or its replication factor changed.

        Show
        Mark Miller added a comment - so basically, if I understood correctly, is that the overseer has the capability of doing periodic checks without an explicit action / request from a client which can help on cleaning states / check for failures / etc. Yeah - basically, either every n seconds, or when the overseer sees a new node come or go, it looks at each collection, checks its replication factor, and either adds or removes nodes to match it given the nodes that are currently up. So with some lag, whatever you set for the replication will eventually be matched no matter the failures or random state of the cluster when the collection is created or its replication factor changed.
        Hide
        Sami Siren added a comment -

        Yeah - basically, either every n seconds, or when the overseer sees a new node come or go, it looks at each collection, checks its replication factor, and either adds or removes nodes to match it given the nodes that are currently up. So with some lag, whatever you set for the replication will eventually be matched no matter the failures or random state of the cluster when the collection is created or its replication factor changed.

        That sounds like a good goal. I think we need to have special handling for situation where whole cluster/collection is bounced or some planned maintenance is to be done.

        Hdfs has this feature called safe mode that is enabled on startup (and can be turned on at any point) and while in that mode replication of blocks is prohibited. When certain percentage of blocks are availabe it moves away from this mode. Something like that might work in solr context also - meaning no shard reorg would happen until certain percentage of the shards are available or solr is specifically told to leave this mode.

        http://hadoop.apache.org/hdfs/docs/current/hdfs_user_guide.html#Safemode

        Show
        Sami Siren added a comment - Yeah - basically, either every n seconds, or when the overseer sees a new node come or go, it looks at each collection, checks its replication factor, and either adds or removes nodes to match it given the nodes that are currently up. So with some lag, whatever you set for the replication will eventually be matched no matter the failures or random state of the cluster when the collection is created or its replication factor changed. That sounds like a good goal. I think we need to have special handling for situation where whole cluster/collection is bounced or some planned maintenance is to be done. Hdfs has this feature called safe mode that is enabled on startup (and can be turned on at any point) and while in that mode replication of blocks is prohibited. When certain percentage of blocks are availabe it moves away from this mode. Something like that might work in solr context also - meaning no shard reorg would happen until certain percentage of the shards are available or solr is specifically told to leave this mode. http://hadoop.apache.org/hdfs/docs/current/hdfs_user_guide.html#Safemode
        Hide
        Mark Miller added a comment -

        To get something incrementally committable I'm changing from using a collection template to a simple numReplicas. I have hit an annoying stall where it is difficult to get all of the node host urls. The live_nodes list is translated from url to path safe. It's not reversible if _ is in the original url. You can put the url in data at each node, but then you have to slowly read each node rather than a simple getChildren call. You can also try and find every node by running through the whole json cluster state file - but that wouldn't give you any nodes that had no cores on it at the moment (say after a collection delete).

        Show
        Mark Miller added a comment - To get something incrementally committable I'm changing from using a collection template to a simple numReplicas. I have hit an annoying stall where it is difficult to get all of the node host urls. The live_nodes list is translated from url to path safe. It's not reversible if _ is in the original url. You can put the url in data at each node, but then you have to slowly read each node rather than a simple getChildren call. You can also try and find every node by running through the whole json cluster state file - but that wouldn't give you any nodes that had no cores on it at the moment (say after a collection delete).
        Hide
        Mark Miller added a comment -

        Re the above: I'm tempted to add another data node that just has the list of nodes? I think it would be good to have an efficient way to get that list. It's pain with clusterstate.json and that loses nodes with no cores on it now.

        Something I just remembered I have to look into: the default location of the data dir for on the fly cores that are created is probably not great.

        Show
        Mark Miller added a comment - Re the above: I'm tempted to add another data node that just has the list of nodes? I think it would be good to have an efficient way to get that list. It's pain with clusterstate.json and that loses nodes with no cores on it now. Something I just remembered I have to look into: the default location of the data dir for on the fly cores that are created is probably not great.
        Hide
        Mark Miller added a comment -

        I'm about ready to commit a basic first draft so it is easier for others to contribute.

        One of the main limitations is that it picks servers randomly to create a new collection - we probably want to change that before 4 so that it chooses servers based on the number of cores they already have or something.

        Whether we allow different creation strategies that let you specify which servers to choose in alternate ways still seems up for debate - but that is also easy to add in if we choose to go that route.

        Show
        Mark Miller added a comment - I'm about ready to commit a basic first draft so it is easier for others to contribute. One of the main limitations is that it picks servers randomly to create a new collection - we probably want to change that before 4 so that it chooses servers based on the number of cores they already have or something. Whether we allow different creation strategies that let you specify which servers to choose in alternate ways still seems up for debate - but that is also easy to add in if we choose to go that route.
        Hide
        Mark Miller added a comment -

        Before I commit I have to track down an occasional fail where an index directory fails to be made.

        Show
        Mark Miller added a comment - Before I commit I have to track down an occasional fail where an index directory fails to be made.
        Hide
        Mark Miller added a comment -

        Looks like the issue is the data dir for created collections in tests - because each jetty shares the same solr home, dynamically created cores from the collections api share data dirs (bad). Other tests specify appropriate data dirs, but the collections api does not provide an easy way to specify different data dirs for each collection on each node that it will create...

        Thinking...

        Show
        Mark Miller added a comment - Looks like the issue is the data dir for created collections in tests - because each jetty shares the same solr home, dynamically created cores from the collections api share data dirs (bad). Other tests specify appropriate data dirs, but the collections api does not provide an easy way to specify different data dirs for each collection on each node that it will create... Thinking...
        Hide
        Mark Miller added a comment -

        So I refactored some of the zk distrib tests to use a new solrhome copy for each jetty rather than simply a different data directory. Trying to address this with the data directory seemed like a non starter across multiple jetty instances. With a unique Solr home though, creating new cores dynamically with the default data dir will no longer step on each other.

        Doing some last minute testing, but I'll commit what I have to trunk shortly.

        Show
        Mark Miller added a comment - So I refactored some of the zk distrib tests to use a new solrhome copy for each jetty rather than simply a different data directory. Trying to address this with the data directory seemed like a non starter across multiple jetty instances. With a unique Solr home though, creating new cores dynamically with the default data dir will no longer step on each other. Doing some last minute testing, but I'll commit what I have to trunk shortly.
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Mark Miller added a comment -

        I'm going to add a collection RELOAD command, and beef up the tests a little. Still more to do after that.

        Show
        Mark Miller added a comment - I'm going to add a collection RELOAD command, and beef up the tests a little. Still more to do after that.
        Hide
        Markus Jelsma added a comment -

        Is it intended for a collection RELOAD action to reload all collection cores immediately? That implies downtime i assume?

        Show
        Markus Jelsma added a comment - Is it intended for a collection RELOAD action to reload all collection cores immediately? That implies downtime i assume?
        Hide
        Mark Miller added a comment -

        Is it intended for a collection RELOAD action to reload all collection cores immediately?

        Yes, at least initially. Essentially a convenience method for reloading your cores to pick up changed config or settings. There may be other ways we allow that to happen more automatically eventually, but at a minimum we need the ability to trigger a collection wide reload. There are things to consider for a truly massive cluster - do you really want every node trying to read the new configs form zk at the same time? That's in the future if I end up working on it. We'd have to see how many servers it takes before you end up with a problem, if it is indeed a problem at all.

        That implies downtime i assume?

        I'm not sure why? Core reloads don't involve any down time.

        Show
        Mark Miller added a comment - Is it intended for a collection RELOAD action to reload all collection cores immediately? Yes, at least initially. Essentially a convenience method for reloading your cores to pick up changed config or settings. There may be other ways we allow that to happen more automatically eventually, but at a minimum we need the ability to trigger a collection wide reload. There are things to consider for a truly massive cluster - do you really want every node trying to read the new configs form zk at the same time? That's in the future if I end up working on it. We'd have to see how many servers it takes before you end up with a problem, if it is indeed a problem at all. That implies downtime i assume? I'm not sure why? Core reloads don't involve any down time.
        Hide
        Markus Jelsma added a comment -

        Thanks for claryfing, it makes sense. About the downtime on core reload, a load balancer pinging Solr's admin/ping handler will definately mark the node as down; the ping request will time out for up to a few seconds or even longer in case of many firstSearcher events.

        Show
        Markus Jelsma added a comment - Thanks for claryfing, it makes sense. About the downtime on core reload, a load balancer pinging Solr's admin/ping handler will definately mark the node as down; the ping request will time out for up to a few seconds or even longer in case of many firstSearcher events.
        Hide
        Yonik Seeley added a comment -

        the ping request will time out for up to a few seconds or even longer in case of many firstSearcher events.

        Hmmm, perhaps we could add an option for the new core to have a registered searcher before we swap it in and close the old core?

        Show
        Yonik Seeley added a comment - the ping request will time out for up to a few seconds or even longer in case of many firstSearcher events. Hmmm, perhaps we could add an option for the new core to have a registered searcher before we swap it in and close the old core?
        Hide
        Mark Miller added a comment -

        Yeah, this sounds like something we have to fix to me. There should not be a gap in serving requests on core reload.

        Show
        Mark Miller added a comment - Yeah, this sounds like something we have to fix to me. There should not be a gap in serving requests on core reload.
        Hide
        Yonik Seeley added a comment -

        There should not be a gap in serving requests on core reload.

        Just to clarify: it's more a practical gap than a real gap... it should be impossible for a query to not be serviced - it's just that a cold core could take longer to service the query than desired. But it should be pretty easy to allow waiting for that searcher in the new core.

        Show
        Yonik Seeley added a comment - There should not be a gap in serving requests on core reload. Just to clarify: it's more a practical gap than a real gap... it should be impossible for a query to not be serviced - it's just that a cold core could take longer to service the query than desired. But it should be pretty easy to allow waiting for that searcher in the new core.
        Hide
        Mark Miller added a comment -

        Ah, did not catch it was just a timeout issue. Was wondering what the problem could be.

        Yeah, not as bad I thought then. An option would be nice.

        Show
        Mark Miller added a comment - Ah, did not catch it was just a timeout issue. Was wondering what the problem could be. Yeah, not as bad I thought then. An option would be nice.
        Hide
        Mark Miller added a comment -

        Commit improved tests and reload command in a moment.

        Another thing we will need before too long is a way to get a response I think? Right now, the client can't learn of the success or failure of the command. It's just int he Overseers logs.

        To get notified, I suppose the call would have to block and then get a result from the overseer.

        I suppose that could be done by something like: create a new emphemeral node for each job - client watches the node - when overseer is done, it sets the result as data on the node - client gets a watch notify and reads the result? Then how to clean up? Not sure about the idea overall, brainstorming ... don't see a simple way to have the over seer do the work in an async fashion and have the client easily get the results of that.

        Show
        Mark Miller added a comment - Commit improved tests and reload command in a moment. Another thing we will need before too long is a way to get a response I think? Right now, the client can't learn of the success or failure of the command. It's just int he Overseers logs. To get notified, I suppose the call would have to block and then get a result from the overseer. I suppose that could be done by something like: create a new emphemeral node for each job - client watches the node - when overseer is done, it sets the result as data on the node - client gets a watch notify and reads the result? Then how to clean up? Not sure about the idea overall, brainstorming ... don't see a simple way to have the over seer do the work in an async fashion and have the client easily get the results of that.
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Lance Norskog added a comment -

        Yeah, a work queue in ZK makes perfect sense.

        http://zookeeper-user.578899.n2.nabble.com/Announcing-KeptCollections-distributed-Java-Collections-for-ZooKeeper-td5816709.html

        https://github.com/anthonyu/KeptCollections

        Distributed Java Collections implementations. Apache licensed. Years of use.

        Show
        Lance Norskog added a comment - Yeah, a work queue in ZK makes perfect sense. http://zookeeper-user.578899.n2.nabble.com/Announcing-KeptCollections-distributed-Java-Collections-for-ZooKeeper-td5816709.html https://github.com/anthonyu/KeptCollections Distributed Java Collections implementations. Apache licensed. Years of use.
        Hide
        Jan Høydahl added a comment -

        I've been playing with the collections API lately.

        First, I am more in favor of the approach where all config and config changes are done against ZK. I do not like the idea of having to start a solr node in order to define a new collection or change various configs. All initial config as well as config changes should be possible to check in to source control and roll out to a cluster as files without starting and stopping live nodes (perhaps except ZK itself).

        Then regarding the current collections API. I don't know if it is by design or a bug:

        Show
        Jan Høydahl added a comment - I've been playing with the collections API lately. First, I am more in favor of the approach where all config and config changes are done against ZK. I do not like the idea of having to start a solr node in order to define a new collection or change various configs. All initial config as well as config changes should be possible to check in to source control and roll out to a cluster as files without starting and stopping live nodes (perhaps except ZK itself). Then regarding the current collections API. I don't know if it is by design or a bug: Start a master, boostrapping collection1 from local config Create another collection specifying number of replicas using the collection API: http://localhost:8983/solr/admin/collections?action=CREATE&name=demo&numShards=1&numReplicas=5 Add more Solr nodes to the cluster The original collection "collection1" starts using the extra nodes as replicas, but the new "demo" collection does not
        Hide
        Mark Miller added a comment -

        I don't know if it is by design or a bug:

        For the moment, it's by design. "collection1" would only show up on the new nodes because you have it pre configured in solr.xml. Unless you take that out, it will be on any node you ever start. That's just the result of how we currently ship Solr and the user not making any changes.

        We do plan on making it so that your replication setting for a collection is adjusted over time - so if you only have 2 replicas now, but create the collection with a replication factor of 5, when you add 3 nodes, we'd like the overseer to recognize this and create the new replicas.

        Show
        Mark Miller added a comment - I don't know if it is by design or a bug: For the moment, it's by design. "collection1" would only show up on the new nodes because you have it pre configured in solr.xml. Unless you take that out, it will be on any node you ever start. That's just the result of how we currently ship Solr and the user not making any changes. We do plan on making it so that your replication setting for a collection is adjusted over time - so if you only have 2 replicas now, but create the collection with a replication factor of 5, when you add 3 nodes, we'd like the overseer to recognize this and create the new replicas.
        Hide
        Yonik Seeley added a comment -

        but create the collection with a replication factor of 5,

        Which reminds me... is "replicationFactor" a better name than "numReplicas" that we have now? If so we should change at the same time we make the changes for the slice properties. It seems like replicationFactor better expresses that it's a target and not the actual number of replicas. This will also make things less confusing when we allow the replication factor to be overridden on a per-shard basis. Imagine seeing "numReplicas=3" as a shard property but then seeing only one replica listed under that shard!

        Show
        Yonik Seeley added a comment - but create the collection with a replication factor of 5, Which reminds me... is "replicationFactor" a better name than "numReplicas" that we have now? If so we should change at the same time we make the changes for the slice properties. It seems like replicationFactor better expresses that it's a target and not the actual number of replicas. This will also make things less confusing when we allow the replication factor to be overridden on a per-shard basis. Imagine seeing "numReplicas=3" as a shard property but then seeing only one replica listed under that shard!
        Hide
        Mark Miller added a comment -

        is "replicationFactor" a better name than "numReplicas"

        +1

        Show
        Mark Miller added a comment - is "replicationFactor" a better name than "numReplicas" +1
        Hide
        Mark Miller added a comment -

        SOLR-3845 : Rename numReplicas to replicationFactor in Collections API.

        Show
        Mark Miller added a comment - SOLR-3845 : Rename numReplicas to replicationFactor in Collections API.
        Hide
        Mark Miller added a comment -

        First, I am more in favor of the approach where all config and config changes are done against ZK. I do not like the idea of having to start a solr node in order to define a new collection or change various configs. All initial config as well as config changes should be possible to check in to source control and roll out to a cluster as files without starting and stopping live nodes (perhaps except ZK itself).

        I'm not sure I follow that...why would you need to start a solr node to deal with collections? The collection manager is designed the same way as core admin - you should not need a core...

        Show
        Mark Miller added a comment - First, I am more in favor of the approach where all config and config changes are done against ZK. I do not like the idea of having to start a solr node in order to define a new collection or change various configs. All initial config as well as config changes should be possible to check in to source control and roll out to a cluster as files without starting and stopping live nodes (perhaps except ZK itself). I'm not sure I follow that...why would you need to start a solr node to deal with collections? The collection manager is designed the same way as core admin - you should not need a core...
        Hide
        Jan Høydahl added a comment -

        I'm not sure I follow that...why would you need to start a solr node to deal with collections? The collection manager is designed the same way as core admin - you should not need a core...

        So how to create a new collection offline? Push an updated solr.xml to ZK?

        Show
        Jan Høydahl added a comment - I'm not sure I follow that...why would you need to start a solr node to deal with collections? The collection manager is designed the same way as core admin - you should not need a core... So how to create a new collection offline? Push an updated solr.xml to ZK?
        Hide
        Mark Miller added a comment -

        Yes, if you want to predefine a collection you do it in solr.xml, the same way that collection1 is done. Otherwise, you can start Solr with no collections and create them with the API.

        Show
        Mark Miller added a comment - Yes, if you want to predefine a collection you do it in solr.xml, the same way that collection1 is done. Otherwise, you can start Solr with no collections and create them with the API.
        Hide
        Mark Miller added a comment -

        Further work should be new issues.

        Show
        Mark Miller added a comment - Further work should be new issues.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            3 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development