Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: SolrCloud, update
    • Labels:
      None

      Description

      Need state for shards that indicate they are recovering, active/enabled, or disabled.

      1. cluster_state-file.patch
        57 kB
        Jamie Johnson
      2. combined.patch
        56 kB
        Jamie Johnson
      3. incremental_update.patch
        46 kB
        Jamie Johnson
      4. scheduled_executors.patch
        50 kB
        Jamie Johnson
      5. shard-roles.patch
        4 kB
        Jamie Johnson
      6. solrcloud.patch
        47 kB
        Jamie Johnson
      7. solrcloud.patch
        62 kB
        Jamie Johnson
      8. solrcloud.patch
        110 kB
        Jamie Johnson

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -
          • we probably want states on a per shard basis (in case we go with micro-sharding, a node may have multiple shards in different states).
          • we might want a state on the node also... a way to mark it as "disabled" in general (note to rest of cluster - consider the node to be down)
          • an active/enabled shard should be preferred as a leader

          Perhaps at the same time thing of adding "roles" to nodes. A comma separated list of values that have some pre-defined values, but that the user may also choose to define their own values. One example use case would be to have a bank of indexers for rich text (PDF, Word, etc) that do all the work of text extraction or other expensive processing and forward the results to the right leader. This could also be used to remove all search traffic from a node (by removing the standard "searcher" role) but allow it to stay up-to-date by remaining in the indexing loop.

          Show
          Yonik Seeley added a comment - we probably want states on a per shard basis (in case we go with micro-sharding, a node may have multiple shards in different states). we might want a state on the node also... a way to mark it as "disabled" in general (note to rest of cluster - consider the node to be down) an active/enabled shard should be preferred as a leader Perhaps at the same time thing of adding "roles" to nodes. A comma separated list of values that have some pre-defined values, but that the user may also choose to define their own values. One example use case would be to have a bank of indexers for rich text (PDF, Word, etc) that do all the work of text extraction or other expensive processing and forward the results to the right leader. This could also be used to remove all search traffic from a node (by removing the standard "searcher" role) but allow it to stay up-to-date by remaining in the indexing loop.
          Hide
          Jamie Johnson added a comment -

          Yonik,

          I had a need to have this role capability so I could dynamically add/remove/discover solr instances and their responsibility as the state of the cloud changed. To do this I added the following snippet to ZKController.java, CoreContainer.java and CloudDescriptor.java to incorporate this information. Now in solr.xml you define the following:

          <core name="coreName" instanceDir="." shard="shard1" collection="collection" roles="searcher,indexer"/>

          I've attached the patch for comment (wasn't done against trunk but I can try to pull that down and do it there if necessary).

          Show
          Jamie Johnson added a comment - Yonik, I had a need to have this role capability so I could dynamically add/remove/discover solr instances and their responsibility as the state of the cloud changed. To do this I added the following snippet to ZKController.java, CoreContainer.java and CloudDescriptor.java to incorporate this information. Now in solr.xml you define the following: <core name="coreName" instanceDir="." shard="shard1" collection="collection" roles="searcher,indexer"/> I've attached the patch for comment (wasn't done against trunk but I can try to pull that down and do it there if necessary).
          Hide
          Mark Miller added a comment - - edited

          This is where incremental update of the cloud state gets tricky...

          If you have something like these roles at the shard level, all of a sudden you cannot change them on the fly because the new incremental update will not pick them up.

          Its a tricky situation - without incremental, things start to get nasty at a huge number of shards. One possibility is that everyone also watches another node, that when pinged, causes a full read - so that most cloud state updates are incremental, but when per shard info like this is changed on the fly, you can then trigger a full read by everyone...

          Show
          Mark Miller added a comment - - edited This is where incremental update of the cloud state gets tricky... If you have something like these roles at the shard level, all of a sudden you cannot change them on the fly because the new incremental update will not pick them up. Its a tricky situation - without incremental, things start to get nasty at a huge number of shards. One possibility is that everyone also watches another node, that when pinged, causes a full read - so that most cloud state updates are incremental, but when per shard info like this is changed on the fly, you can then trigger a full read by everyone...
          Hide
          Jamie Johnson added a comment -

          Yeah 100% agree. The current implementation of update doesn't check to see if the data in the node changed, you'd need a watcher on each node to do that. The other project that I'm working on does just that. We create a watcher on /live_nodes to track the list of available servers, we create a watch on the collection to see if a slice was added/removed, we create a watcher on each slice (not sure if that is the correct terminology) to check if a shard is added/removed and subsequently a watcher on each shard to track data changes. So lots of watchers all around.

          Would it be easier to store this information on the ephemeral nodes (under live_nodes)? Then we only need a watcher for live_nodes (add/remove) and a watcher for each shard under live_nodes to see if their data changed. I'm not sure what else is using the collection hierarchy (just query?), but perhaps would be a bit simpler.

          Show
          Jamie Johnson added a comment - Yeah 100% agree. The current implementation of update doesn't check to see if the data in the node changed, you'd need a watcher on each node to do that. The other project that I'm working on does just that. We create a watcher on /live_nodes to track the list of available servers, we create a watch on the collection to see if a slice was added/removed, we create a watcher on each slice (not sure if that is the correct terminology) to check if a shard is added/removed and subsequently a watcher on each shard to track data changes. So lots of watchers all around. Would it be easier to store this information on the ephemeral nodes (under live_nodes)? Then we only need a watcher for live_nodes (add/remove) and a watcher for each shard under live_nodes to see if their data changed. I'm not sure what else is using the collection hierarchy (just query?), but perhaps would be a bit simpler.
          Hide
          Mark Miller added a comment -

          hmmm...you know forever I have been trying to avoid juggling potentially 100's to 1000's of watchers per node like that. Perhaps it really is the lesser of all evils though? This whole situation is why I initially avoided incremental update and it only went in recently. How many zk nodes do you tend to see in that project with lots of watchers?

          The problem with live nodes is that its per solr instance, and some of this info will probably want to be per core? if we made a new structure that put cores/shards under a solr instance node, I guess that could reduce the number watches needed - but I wonder if it's worth the effort - I'd almost rather see the full watcher option on the current node structure fall down first...

          Show
          Mark Miller added a comment - hmmm...you know forever I have been trying to avoid juggling potentially 100's to 1000's of watchers per node like that. Perhaps it really is the lesser of all evils though? This whole situation is why I initially avoided incremental update and it only went in recently. How many zk nodes do you tend to see in that project with lots of watchers? The problem with live nodes is that its per solr instance, and some of this info will probably want to be per core? if we made a new structure that put cores/shards under a solr instance node, I guess that could reduce the number watches needed - but I wonder if it's worth the effort - I'd almost rather see the full watcher option on the current node structure fall down first...
          Hide
          Jamie Johnson added a comment - - edited

          I haven't tested my current implementation at scale yet (currently have tested it at 200 slices each with 2 shards but only had 1 client listening for these changes) and it ran fine. I will try to put together patch which will do the same thing for Solr on the current structure, but no guarantee I'll get to it today.

          Show
          Jamie Johnson added a comment - - edited I haven't tested my current implementation at scale yet (currently have tested it at 200 slices each with 2 shards but only had 1 client listening for these changes) and it ran fine. I will try to put together patch which will do the same thing for Solr on the current structure, but no guarantee I'll get to it today.
          Hide
          Mark Miller added a comment -

          Yeah, my worry is every zk node listening for changes on every other zk node ... a 'small' n^2 explosion of watches...

          Show
          Mark Miller added a comment - Yeah, my worry is every zk node listening for changes on every other zk node ... a 'small' n^2 explosion of watches...
          Hide
          Jamie Johnson added a comment - - edited

          I've attached my first hack at this. From what I can tell it works as I expect, I touched a few other files since forcing an update doesn't make sense anymore since we have a watcher on that node and update it whenever a change occurs. I'm also updating the CloudState object in a way that it wasn't intended (had to remove the code that made the collections unmodifiable), we could probably put some setters on CloudState so that other classes don't have to step through the collection in a way that is potentially harmful since I'm currently returning the live lists, but for now I think this is a good first step.

          As always comments are welcomed.

          Show
          Jamie Johnson added a comment - - edited I've attached my first hack at this. From what I can tell it works as I expect, I touched a few other files since forcing an update doesn't make sense anymore since we have a watcher on that node and update it whenever a change occurs. I'm also updating the CloudState object in a way that it wasn't intended (had to remove the code that made the collections unmodifiable), we could probably put some setters on CloudState so that other classes don't have to step through the collection in a way that is potentially harmful since I'm currently returning the live lists, but for now I think this is a good first step. As always comments are welcomed.
          Hide
          Ted Dunning added a comment -

          I don't think that you need all nodes to watch all nodes.

          What you need is a directory of assigned states that a central overseer sets. Each node will watch their assignment to see which shards to serve and what state to assume.

          Then you need the live_nodes directory where all nodes can advertise an ephemeral showing their current status including which shards served and any other information. Since this is SOLR, XML is a natural format for that.

          The only watcher on the live_nodes directory and the node files in that directory is the overseer.

          If a node dies, ZK will send notifications to all live query connections and the overseer.

          It is also nice to have the current serving state for query connections to be in a file or in directory form. This file could be updated (atomically) by nodes as they start or stop serving shards, but would have to be updated by the overseer on node failure. Using a per shard directory allows the list of nodes serving a shard to be handled as ephemerals, but it is a bit less desirable because zknode names have to contain node references which is an example of name/content confusion. Since atomic update is easy in ZK, the file implementation is probably better.

          The update to current shard state will ultimately cause a notification to be sent to each live query connection.

          Show
          Ted Dunning added a comment - I don't think that you need all nodes to watch all nodes. What you need is a directory of assigned states that a central overseer sets. Each node will watch their assignment to see which shards to serve and what state to assume. Then you need the live_nodes directory where all nodes can advertise an ephemeral showing their current status including which shards served and any other information. Since this is SOLR, XML is a natural format for that. The only watcher on the live_nodes directory and the node files in that directory is the overseer. If a node dies, ZK will send notifications to all live query connections and the overseer. It is also nice to have the current serving state for query connections to be in a file or in directory form. This file could be updated (atomically) by nodes as they start or stop serving shards, but would have to be updated by the overseer on node failure. Using a per shard directory allows the list of nodes serving a shard to be handled as ephemerals, but it is a bit less desirable because zknode names have to contain node references which is an example of name/content confusion. Since atomic update is easy in ZK, the file implementation is probably better. The update to current shard state will ultimately cause a notification to be sent to each live query connection.
          Hide
          Jamie Johnson added a comment -

          Hmm...interesting idea, but what happens if 1 shard changes its role? We'd update this central overseer and he'd need to pull the entire cloud state again? I guess I am not quite following this.

          Show
          Jamie Johnson added a comment - Hmm...interesting idea, but what happens if 1 shard changes its role? We'd update this central overseer and he'd need to pull the entire cloud state again? I guess I am not quite following this.
          Hide
          Ted Dunning added a comment -

          There are several kinds of shard state changes. As I see it, all of them resolve into two situations. One is "more replicas are needed" and the other is "fewer replicas are needed". In the former case, the overseer will assign one or more nodes additional shards to serve and the nodes will take it from there, eventually advertising that they handle those shards. In the latter case, the overseer will edit the assignments for one or more nodes reducing their duties. The nodes will notice that and will close the corresponding indexes.

          It is possible that you want to have a loaded-but-not-available-for-searching state for shards. In that case, what you really have is two nearly orthogonal specifications, one for the nodes and the other for the clients. The node assignments are just as above, but the client specification determines which nodes the client is allowed to ask for which shards/collections.

          Does that answer your question?

          Show
          Ted Dunning added a comment - There are several kinds of shard state changes. As I see it, all of them resolve into two situations. One is "more replicas are needed" and the other is "fewer replicas are needed". In the former case, the overseer will assign one or more nodes additional shards to serve and the nodes will take it from there, eventually advertising that they handle those shards. In the latter case, the overseer will edit the assignments for one or more nodes reducing their duties. The nodes will notice that and will close the corresponding indexes. It is possible that you want to have a loaded-but-not-available-for-searching state for shards. In that case, what you really have is two nearly orthogonal specifications, one for the nodes and the other for the clients. The node assignments are just as above, but the client specification determines which nodes the client is allowed to ask for which shards/collections. Does that answer your question?
          Hide
          Jamie Johnson added a comment -

          That answers part of it, I am trying to consider this with regards to the project I am currently working. On this project we have the case where we are also interested in additional slices/shards becoming available serving more data. So in this case it's not a question of replicas but completely new slices.

          We also distribute our queries (much like the latest solrj on trunk does) where we randomly choose a server with the role "searcher", I think this means that each searcher needs to be aware of all of the other available servers with the role "searcher" to be able to execute the query. I suppose the servers with role "indexer" do not need to build the watchers as they are not being used in the query process (assuming they aren't also searchers).

          I am just not sure that we can guarantee that the servers won't need to be aware of all other servers. Someone more familiar with the workings of how queries are distributed may have a better idea in this regard.

          All of that being said it may be simpler (as mentioned previously) to create the watchers on the live nodes and move the information that is in the collection structure there. We could then create a watcher on the directory (node is available or removed) and the ephemeral nodes (shard configuration changed) and go on with life. This still gives us an n^2 type of setup though, but in this regard I defer to Mark as he is much more familiar with the inner workings of solr and solr cloud than I. This may also be something that we'd want to reach out to our Zookeeper brethren to get their input on how something like this could be managed and if this is destined to fail as n increases.

          Show
          Jamie Johnson added a comment - That answers part of it, I am trying to consider this with regards to the project I am currently working. On this project we have the case where we are also interested in additional slices/shards becoming available serving more data. So in this case it's not a question of replicas but completely new slices. We also distribute our queries (much like the latest solrj on trunk does) where we randomly choose a server with the role "searcher", I think this means that each searcher needs to be aware of all of the other available servers with the role "searcher" to be able to execute the query. I suppose the servers with role "indexer" do not need to build the watchers as they are not being used in the query process (assuming they aren't also searchers). I am just not sure that we can guarantee that the servers won't need to be aware of all other servers. Someone more familiar with the workings of how queries are distributed may have a better idea in this regard. All of that being said it may be simpler (as mentioned previously) to create the watchers on the live nodes and move the information that is in the collection structure there. We could then create a watcher on the directory (node is available or removed) and the ephemeral nodes (shard configuration changed) and go on with life. This still gives us an n^2 type of setup though, but in this regard I defer to Mark as he is much more familiar with the inner workings of solr and solr cloud than I. This may also be something that we'd want to reach out to our Zookeeper brethren to get their input on how something like this could be managed and if this is destined to fail as n increases.
          Hide
          Ted Dunning added a comment -

          That answers part of it, I am trying to consider this with regards to the project I am currently working. On this project we have the case where we are also interested in additional slices/shards becoming available serving more data. So in this case it's not a question of replicas but completely new slices.

          This is just a case of slices appearing that are not yet replicated. It should be no different than if all nodes handling these were to die simultaneously, really.

          We also distribute our queries (much like the latest solrj on trunk does) where we randomly choose a server with the role "searcher", I think this means that each searcher needs to be aware of all of the other available servers with the role "searcher" to be able to execute the query. I suppose the servers with role "indexer" do not need to build the watchers as they are not being used in the query process (assuming they aren't also searchers).

          I don't follow this at all. Why would servers need to know about other servers?

          This wouldn't be quite n^2 in any case. It could be order n for each change, but the constant factor would be quite small for reasonable clusters. The cost would be n^2/2 as the cluster came up, but even for 1000 nodes, this would clear in less than a minute which is much less time than it would take to actually bring the servers up.

          I still don't think that there would be any need for all servers to know about this. The clients need to know, but not the servers.

          If you mean that a searcher serves as a query proxy between the clients and the servers, then you would require a notification to each search for each change. If you have k searchers and n nodes, bringing up the cluster would require k n / 2 notifications. For 100 proxies and 1000 search nodes, this is a few seconds of ZK work. Again, this is much less than the time required to bring up so many nodes.

          If you put server status into a single file, far fewer notifications would actually be sent because as the notifications are delayed, the watchers get delayed in being reset so you have natural quenching while still staying very nearly up to date.

          Regarding putting the information about the collections available into the live nodes, I think that would be inefficient for the clients compared to putting it into a summarized file. For commanding the nodes, it is very bad practice to mix command and status files in ZK.

          I am a Zookeeper brother, btw. (PMC member and all that)

          Show
          Ted Dunning added a comment - That answers part of it, I am trying to consider this with regards to the project I am currently working. On this project we have the case where we are also interested in additional slices/shards becoming available serving more data. So in this case it's not a question of replicas but completely new slices. This is just a case of slices appearing that are not yet replicated. It should be no different than if all nodes handling these were to die simultaneously, really. We also distribute our queries (much like the latest solrj on trunk does) where we randomly choose a server with the role "searcher", I think this means that each searcher needs to be aware of all of the other available servers with the role "searcher" to be able to execute the query. I suppose the servers with role "indexer" do not need to build the watchers as they are not being used in the query process (assuming they aren't also searchers). I don't follow this at all. Why would servers need to know about other servers? This wouldn't be quite n^2 in any case. It could be order n for each change, but the constant factor would be quite small for reasonable clusters. The cost would be n^2/2 as the cluster came up, but even for 1000 nodes, this would clear in less than a minute which is much less time than it would take to actually bring the servers up. I still don't think that there would be any need for all servers to know about this. The clients need to know, but not the servers. If you mean that a searcher serves as a query proxy between the clients and the servers, then you would require a notification to each search for each change. If you have k searchers and n nodes, bringing up the cluster would require k n / 2 notifications. For 100 proxies and 1000 search nodes, this is a few seconds of ZK work. Again, this is much less than the time required to bring up so many nodes. If you put server status into a single file, far fewer notifications would actually be sent because as the notifications are delayed, the watchers get delayed in being reset so you have natural quenching while still staying very nearly up to date. Regarding putting the information about the collections available into the live nodes, I think that would be inefficient for the clients compared to putting it into a summarized file. For commanding the nodes, it is very bad practice to mix command and status files in ZK. I am a Zookeeper brother, btw. (PMC member and all that)
          Hide
          Jamie Johnson added a comment -

          I don't follow this at all. Why would servers need to know about other servers?

          The searcher servers would need to be aware of the other searcher servers wouldn't they? Assuming that a query could come into any one of those servers, they would all need to be aware of any other server which was acting as a searcher, right?

          This wouldn't be quite n^2 in any case. It could be order n for each change, but the constant factor would be quite small for reasonable clusters. The cost would be n^2/2 as the cluster came up, but even for 1000 nodes, this would clear in less than a minute which is much less time than it would take to actually bring the servers up.

          Right, each change would get replicated to the n servers watching that node. The total number of watchers is 1 for each node for each server, so n^2 total watchers. As you say though the number of watch events triggered should be reasonable as a stable cluster will not have a significant amount of shards/slices going up and down during operation. This seems to be an argument for not going with the summarized approach. The amount of information pulled in this case would be much less compared to a summarized file as each searcher would need to pull the entire state instead of just 1 change.

          I still don't think that there would be any need for all servers to know about this. The clients need to know, but not the servers.

          If you mean that a searcher serves as a query proxy between the clients and the servers, then you would require a notification to each search for each change. If you have k searchers and n nodes, bringing up the cluster would require k n / 2 notifications. For 100 proxies and 1000 search nodes, this is a few seconds of ZK work. Again, this is much less than the time required to bring up so many nodes.

          You're right that all servers don't need to be aware of the cloud state, but the servers which are acting as searchers do (are there any other roles that also need to know?), I don't see any difference between them and any other client since they'll need to know about the other searchers to distribute the query.

          If you put server status into a single file, far fewer notifications would actually be sent because as the notifications are delayed, the watchers get delayed in being reset so you have natural quenching while still staying very nearly up to date.

          I see your point, to do something like this though we'd need to have some leader which was responsible for maintaining this state. Wouldn't this still result in a large amount of data being pulled by the searchers for a very small change (i.e. 1 additional searcher being added). I am guessing that your of the opinion that it would be negligible since the number of changes should be small so it would not pull that information very frequently (maybe just on startup)? I'm not sure if that would be an issue or not, so I'm open for input. The loggly guys ran into an issue with pulling the entire state but that may have been because the current implementation on trunk polls the state every 5 seconds for updates instead of using watchers.

          Regarding putting the information about the collections available into the live nodes, I think that would be inefficient for the clients compared to putting it into a summarized file. For commanding the nodes, it is very bad practice to mix command and status files in ZK.

          Understood, I'll have to look at the leader task to see where it is at because that would be a blocker to this if we were to attempt to do some summarized file.

          I am a Zookeeper brother, btw. (PMC member and all that)

          Good to know, I am just a wanderer in this world so never sure who worships where

          Show
          Jamie Johnson added a comment - I don't follow this at all. Why would servers need to know about other servers? The searcher servers would need to be aware of the other searcher servers wouldn't they? Assuming that a query could come into any one of those servers, they would all need to be aware of any other server which was acting as a searcher, right? This wouldn't be quite n^2 in any case. It could be order n for each change, but the constant factor would be quite small for reasonable clusters. The cost would be n^2/2 as the cluster came up, but even for 1000 nodes, this would clear in less than a minute which is much less time than it would take to actually bring the servers up. Right, each change would get replicated to the n servers watching that node. The total number of watchers is 1 for each node for each server, so n^2 total watchers. As you say though the number of watch events triggered should be reasonable as a stable cluster will not have a significant amount of shards/slices going up and down during operation. This seems to be an argument for not going with the summarized approach. The amount of information pulled in this case would be much less compared to a summarized file as each searcher would need to pull the entire state instead of just 1 change. I still don't think that there would be any need for all servers to know about this. The clients need to know, but not the servers. If you mean that a searcher serves as a query proxy between the clients and the servers, then you would require a notification to each search for each change. If you have k searchers and n nodes, bringing up the cluster would require k n / 2 notifications. For 100 proxies and 1000 search nodes, this is a few seconds of ZK work. Again, this is much less than the time required to bring up so many nodes. You're right that all servers don't need to be aware of the cloud state, but the servers which are acting as searchers do (are there any other roles that also need to know?), I don't see any difference between them and any other client since they'll need to know about the other searchers to distribute the query. If you put server status into a single file, far fewer notifications would actually be sent because as the notifications are delayed, the watchers get delayed in being reset so you have natural quenching while still staying very nearly up to date. I see your point, to do something like this though we'd need to have some leader which was responsible for maintaining this state. Wouldn't this still result in a large amount of data being pulled by the searchers for a very small change (i.e. 1 additional searcher being added). I am guessing that your of the opinion that it would be negligible since the number of changes should be small so it would not pull that information very frequently (maybe just on startup)? I'm not sure if that would be an issue or not, so I'm open for input. The loggly guys ran into an issue with pulling the entire state but that may have been because the current implementation on trunk polls the state every 5 seconds for updates instead of using watchers. Regarding putting the information about the collections available into the live nodes, I think that would be inefficient for the clients compared to putting it into a summarized file. For commanding the nodes, it is very bad practice to mix command and status files in ZK. Understood, I'll have to look at the leader task to see where it is at because that would be a blocker to this if we were to attempt to do some summarized file. I am a Zookeeper brother, btw. (PMC member and all that) Good to know, I am just a wanderer in this world so never sure who worships where
          Hide
          Mark Miller added a comment -

          Away for the weekend, and a lot to comment on, but will have to wait.

          FYI though:

          current implementation on trunk polls the state every 5 seconds for updates instead of using watchers.

          I don't think so? It's been some time since I did this, but it should be using a watcher - the 5 seconds is just how long it should wait before actually grabbing the state - the idea being if node after node comes up and triggers the watch, if there is already a state update scheduled (5 seconds after the watch is triggered by default), a new update is not asked for - eg if nodes are firing up by the hundreds, the state is still only updated once every 5 seconds. It's still triggered by a watch though - I believe a watch on the shards node itself - which a core will ping with a set null data after registering in zk. Before adding the incremental update patch, the idea was that you could change a sub property on a node, and still trigger a reload of the state by triggering the watches on the shards node.

          Show
          Mark Miller added a comment - Away for the weekend, and a lot to comment on, but will have to wait. FYI though: current implementation on trunk polls the state every 5 seconds for updates instead of using watchers. I don't think so? It's been some time since I did this, but it should be using a watcher - the 5 seconds is just how long it should wait before actually grabbing the state - the idea being if node after node comes up and triggers the watch, if there is already a state update scheduled (5 seconds after the watch is triggered by default), a new update is not asked for - eg if nodes are firing up by the hundreds, the state is still only updated once every 5 seconds. It's still triggered by a watch though - I believe a watch on the shards node itself - which a core will ping with a set null data after registering in zk. Before adding the incremental update patch, the idea was that you could change a sub property on a node, and still trigger a reload of the state by triggering the watches on the shards node.
          Hide
          Jamie Johnson added a comment -

          ah yes, you are right, sorry for not paying closer attention to this. The original patch has removed that logic... I think for changes made to individual shards it is not needed, but as you stated when the cluster is coming up this may be more of an issue. I've modified the patch to work in a similar fashion to how it used to. I am unfortunately unable to test this at the moment and won't be able to do so until early next week.

          If this is worth having let me know, otherwise I'll revert this and delete the second patch.

          Show
          Jamie Johnson added a comment - ah yes, you are right, sorry for not paying closer attention to this. The original patch has removed that logic... I think for changes made to individual shards it is not needed, but as you stated when the cluster is coming up this may be more of an issue. I've modified the patch to work in a similar fashion to how it used to. I am unfortunately unable to test this at the moment and won't be able to do so until early next week. If this is worth having let me know, otherwise I'll revert this and delete the second patch.
          Hide
          Mark Miller added a comment -

          to do something like this though we'd need to have some leader which was responsible for maintaining this state. Wouldn't this still result in a large amount of data being pulled by the searchers for a very small change (i.e. 1 additional searcher being added).

          Or, assuming state changes are infrequent, you just use optimistic locking and retries? Or a state lock (though you probably want to avoid global locks if you can, I'm less worried about that for something that will generally be stable I guess)? Even if the whole state was an in XML file, I'm guessing it would be plenty faster though ... I would guess the real issue is the request per node. I'd guess the description of a large cluster in one request would be much faster than reading a bit of data on 500 nodes.

          Show
          Mark Miller added a comment - to do something like this though we'd need to have some leader which was responsible for maintaining this state. Wouldn't this still result in a large amount of data being pulled by the searchers for a very small change (i.e. 1 additional searcher being added). Or, assuming state changes are infrequent, you just use optimistic locking and retries? Or a state lock (though you probably want to avoid global locks if you can, I'm less worried about that for something that will generally be stable I guess)? Even if the whole state was an in XML file, I'm guessing it would be plenty faster though ... I would guess the real issue is the request per node. I'd guess the description of a large cluster in one request would be much faster than reading a bit of data on 500 nodes.
          Hide
          Ted Dunning added a comment -

          Or, assuming state changes are infrequent, you just use optimistic locking and retries?

          Optimistic update (not locking) works great in ZK. Just read, modify and then try to modify the same version. Repeat if the update fails.

          Even if the whole state was an in XML file, I'm guessing it would be plenty faster though ... I would guess the real issue is the request per node. I'd guess the description of a large cluster in one request would be much faster than reading a bit of data on 500 nodes.

          On a local network, ZK requests are considerably less than millisecond. For more distributed networks, they are very near the round-trip time (20ms ... 1000 ms generally). Reading one file of 100K makes a difference of 100ms or less on most modern networks. Reading many files could be a real problem. Getting a list of the znodes in a single directory should be almost as fast as getting the contents of a single znode of the same size.

          The only viable alternative really is to use the directory listing as the state. That has some bad consequences. For instance, you may not want to use the host name as the only identifier of a host. You might want to include several hostnames. You might want to include several IP addresses. You might want to have a hostname that includes slightly dangerous bytes (you nuts if you do, though). To do this with a directory listing is much trickier because you usually need a dictionary off to the side to help you find all the alternative data while still maintaining safe zknode names. If you have the dictionary, that dictionary might as well be the state file anyway and avoid the entire name/data problem.

          Show
          Ted Dunning added a comment - Or, assuming state changes are infrequent, you just use optimistic locking and retries? Optimistic update (not locking) works great in ZK. Just read, modify and then try to modify the same version. Repeat if the update fails. Even if the whole state was an in XML file, I'm guessing it would be plenty faster though ... I would guess the real issue is the request per node. I'd guess the description of a large cluster in one request would be much faster than reading a bit of data on 500 nodes. On a local network, ZK requests are considerably less than millisecond. For more distributed networks, they are very near the round-trip time (20ms ... 1000 ms generally). Reading one file of 100K makes a difference of 100ms or less on most modern networks. Reading many files could be a real problem. Getting a list of the znodes in a single directory should be almost as fast as getting the contents of a single znode of the same size. The only viable alternative really is to use the directory listing as the state. That has some bad consequences. For instance, you may not want to use the host name as the only identifier of a host. You might want to include several hostnames. You might want to include several IP addresses. You might want to have a hostname that includes slightly dangerous bytes (you nuts if you do, though). To do this with a directory listing is much trickier because you usually need a dictionary off to the side to help you find all the alternative data while still maintaining safe zknode names. If you have the dictionary, that dictionary might as well be the state file anyway and avoid the entire name/data problem.
          Hide
          Ted Dunning added a comment - - edited

          the idea being if node after node comes up and triggers the watch, if there is already a state update scheduled (5 seconds after the watch is triggered by default), a new update is not asked for - eg if nodes are firing up by the hundreds, the state is still only updated once every 5 seconds.

          Something like this is a great approach to metering the number of watch notifications that need to be sent by the ZK server. I would prefer to change the strategy slightly so that an update is scheduled immediately if no update has happened in the last 5 seconds, but is not scheduled more often than about every 5 seconds. The update can then unconditionally set the requisite watches.

          This means that you get immediate response to the occasional change, but never require more than n/5 updates per second where n is the number of clients. You can even make the delay be something like n/log_2 ( n )/20 for n >= 2 where n is the number of clients. This gives fast response for small clusters with notification rates that grow slowly for large clusters. At 1000 clients, you get 200 notifications per second worst case and at 8000 clients you get 500 notifications per second. In return for this very moderate transaction growth rate, it takes longer to register that a new cluster has appeared. If many nodes register in rapid succession, it takes 16 seconds for 8000 clients to all get the news, 5 seconds for 1000 clients and just less than a second for 100 clients.

          My guess is that 100 clients is a very large number.

          If counting clients is a pain, the constant value of 5 seconds delay is pretty good.

          Show
          Ted Dunning added a comment - - edited the idea being if node after node comes up and triggers the watch, if there is already a state update scheduled (5 seconds after the watch is triggered by default), a new update is not asked for - eg if nodes are firing up by the hundreds, the state is still only updated once every 5 seconds. Something like this is a great approach to metering the number of watch notifications that need to be sent by the ZK server. I would prefer to change the strategy slightly so that an update is scheduled immediately if no update has happened in the last 5 seconds, but is not scheduled more often than about every 5 seconds. The update can then unconditionally set the requisite watches. This means that you get immediate response to the occasional change, but never require more than n/5 updates per second where n is the number of clients. You can even make the delay be something like n/log_2 ( n )/20 for n >= 2 where n is the number of clients. This gives fast response for small clusters with notification rates that grow slowly for large clusters. At 1000 clients, you get 200 notifications per second worst case and at 8000 clients you get 500 notifications per second. In return for this very moderate transaction growth rate, it takes longer to register that a new cluster has appeared. If many nodes register in rapid succession, it takes 16 seconds for 8000 clients to all get the news, 5 seconds for 1000 clients and just less than a second for 100 clients. My guess is that 100 clients is a very large number. If counting clients is a pain, the constant value of 5 seconds delay is pretty good.
          Hide
          Jamie Johnson added a comment -

          So if I am understanding this we still have the live_nodes but we'd remove the information under collections in favor of a single node which stores the cloud state in some XML format, correct?

          The process would then be in ZkController.register that the new solr instance gets the current state, attempts to add its information to the file and do an update to that particular version, if the update fails (presumably because another solr instance has modified the node resulting in a version bump) we'd simply repeat this.

          Would something like this work for the XML format? We could then iterate through this and update the CloudState locally whenever a change to this has happened.

          <cloudstate>
          <collectionstate name="collection1">
          <slice name="slice1">
          <shard name="shard1">
          <props>
          <str name="url">http://.…</str>
          <str name="node_name">node</str>
          <str name="roles">indexer,searcher</str>
          </props>
          </shard>
          </slice>
          </collectionstate>
          </cloudstate>

          If this seems reasonable I will attempt to make this change.

          Show
          Jamie Johnson added a comment - So if I am understanding this we still have the live_nodes but we'd remove the information under collections in favor of a single node which stores the cloud state in some XML format, correct? The process would then be in ZkController.register that the new solr instance gets the current state, attempts to add its information to the file and do an update to that particular version, if the update fails (presumably because another solr instance has modified the node resulting in a version bump) we'd simply repeat this. Would something like this work for the XML format? We could then iterate through this and update the CloudState locally whenever a change to this has happened. <cloudstate> <collectionstate name="collection1"> <slice name="slice1"> <shard name="shard1"> <props> <str name="url"> http:// .…</str> <str name="node_name">node</str> <str name="roles">indexer,searcher</str> </props> </shard> </slice> </collectionstate> </cloudstate> If this seems reasonable I will attempt to make this change.
          Hide
          Jamie Johnson added a comment -

          patch includes shard-roles.patch and scheduled_executors.patch with some additional fixes to (project now builds which it did not before, added check to see if zknode for shard has data before trying to use it, added delay for shard node and slice node)

          Show
          Jamie Johnson added a comment - patch includes shard-roles.patch and scheduled_executors.patch with some additional fixes to (project now builds which it did not before, added check to see if zknode for shard has data before trying to use it, added delay for shard node and slice node)
          Hide
          Ted Dunning added a comment -

          So if I am understanding this we still have the live_nodes but we'd remove the information under collections in favor of a single node which stores the cloud state in some XML format, correct?

          I don't quite think so. The problem with this is three-fold.

          A) It is very useful to have each ZK structure contain information that flows generally in one direction so that the updates to the information are relatively easy to understand.

          B) It is important to have at least a directory containing ephemeral files to describe which nodes are currently live.

          C) It is nice to separate information bound for different destinations so that the number of notifications is minimized.

          With these considerations in mind, I think it is better to have three basic structures:

          1) A directory of per node assignments to be written by the overseer but watched and read by each node separately. By (A), this should only be written by the overseer and only read by the nodes. By (C) the assignment for each node should be in a separate file in ZK.

          2) A directory of per node status files to be created and updated by the nodes, but watched and read by the overseer. By (A), the nodes write and the overseer reads. By (B) and (C), there is an ephemeral file per node. The files should be named with a key that can be used to find the corresponding status clause in the cluster status file. The contents of the files should probably be the status clause itself if we expect the overseer to update the cluster file or can be empty if the nodes update the status file.

          3) A composite cluster status file created and maintained by the overseer, but possibly also updated by nodes. The overseer will have to update this when nodes disappear, but it would avoid data duplication to have the nodes directly write their status to the cluster status file in slight violation of (A), but we can pretend that only the nodes update the file and the clients read and watch it with the overseer only stepping in to do updates that crashing nodes should do. Neither (B) nor (C) since this information is bound for all clients.

          Note that no locks are necessary. Updates to (1) are only by the overseer. Updates to (2) are only by the nodes and no two nodes will touch the same files. Updates to (3) are either by the nodes directly with a versioned update or by the overseer when it sees changes to (2). The former strategy is probably better.

          The process would then be in ZkController.register that the new solr instance gets the current state, attempts to add its information to the file and do an update to that particular version, if the update fails (presumably because another solr instance has modified the node resulting in a version bump) we'd simply repeat this.

          Yes.

          But the new nodes would also need to create their corresponding ephemeral file.

          Would something like this work for the XML format? We could then iterate through this and update the CloudState locally whenever a change to this has happened.

          <cloudstate>
            <collectionstate name="collection1">
              <slice name="slice1">
                <shard name="shard1">
                  <props>
                    <str name="url">http://.…</str>
                    <str name="node_name">node</str>
                    <str name="roles">indexer,searcher</str>
                  </props>
                </shard>
              </slice>
            </collectionstate>
          </cloudstate>
          

          I think that this is good in principle, but there are few nits that I would raise.

          First, common usage outside Solr equates slice and shard. As such, to avoid confusion by outsiders, I would recommend that shard here be replaced by the term "replica" since that seems to be understood inside the Solr community to have the same meaning as outside.

          Secondly, is the "props" element necessary? It seems excessive to me.

          Thirdly, shouldn't there be a list of IP addresses that might be used to reach the node? I think that it is good practice to have multiple addresses for the common case where a single address might not work. This happens, for instance, with EC2 where each node has internal and external addresses that have different costs and accessibility. In other cases, there might be multiple paths to a node and it might be nice to use all of them for failure tolerance. It is not acceptable to list the node more than once with different IP addresses because that makes it look like more than one node which can cause problems with load balancing.

          Fourthly, I think it should be clarified that the node_name should be used as the key in the live_nodes directory for the ephemeral file corresponding to the node.

          If this seems reasonable I will attempt to make this change.

          Pretty close to reasonable in my book.

          Show
          Ted Dunning added a comment - So if I am understanding this we still have the live_nodes but we'd remove the information under collections in favor of a single node which stores the cloud state in some XML format, correct? I don't quite think so. The problem with this is three-fold. A) It is very useful to have each ZK structure contain information that flows generally in one direction so that the updates to the information are relatively easy to understand. B) It is important to have at least a directory containing ephemeral files to describe which nodes are currently live. C) It is nice to separate information bound for different destinations so that the number of notifications is minimized. With these considerations in mind, I think it is better to have three basic structures: 1) A directory of per node assignments to be written by the overseer but watched and read by each node separately. By (A), this should only be written by the overseer and only read by the nodes. By (C) the assignment for each node should be in a separate file in ZK. 2) A directory of per node status files to be created and updated by the nodes, but watched and read by the overseer. By (A), the nodes write and the overseer reads. By (B) and (C), there is an ephemeral file per node. The files should be named with a key that can be used to find the corresponding status clause in the cluster status file. The contents of the files should probably be the status clause itself if we expect the overseer to update the cluster file or can be empty if the nodes update the status file. 3) A composite cluster status file created and maintained by the overseer, but possibly also updated by nodes. The overseer will have to update this when nodes disappear, but it would avoid data duplication to have the nodes directly write their status to the cluster status file in slight violation of (A), but we can pretend that only the nodes update the file and the clients read and watch it with the overseer only stepping in to do updates that crashing nodes should do. Neither (B) nor (C) since this information is bound for all clients. Note that no locks are necessary. Updates to (1) are only by the overseer. Updates to (2) are only by the nodes and no two nodes will touch the same files. Updates to (3) are either by the nodes directly with a versioned update or by the overseer when it sees changes to (2). The former strategy is probably better. The process would then be in ZkController.register that the new solr instance gets the current state, attempts to add its information to the file and do an update to that particular version, if the update fails (presumably because another solr instance has modified the node resulting in a version bump) we'd simply repeat this. Yes. But the new nodes would also need to create their corresponding ephemeral file. Would something like this work for the XML format? We could then iterate through this and update the CloudState locally whenever a change to this has happened. <cloudstate> <collectionstate name= "collection1" > <slice name= "slice1" > <shard name= "shard1" > <props> <str name= "url" >http: //.…</str> <str name= "node_name" >node</str> <str name= "roles" >indexer,searcher</str> </props> </shard> </slice> </collectionstate> </cloudstate> I think that this is good in principle, but there are few nits that I would raise. First, common usage outside Solr equates slice and shard. As such, to avoid confusion by outsiders, I would recommend that shard here be replaced by the term "replica" since that seems to be understood inside the Solr community to have the same meaning as outside. Secondly, is the "props" element necessary? It seems excessive to me. Thirdly, shouldn't there be a list of IP addresses that might be used to reach the node? I think that it is good practice to have multiple addresses for the common case where a single address might not work. This happens, for instance, with EC2 where each node has internal and external addresses that have different costs and accessibility. In other cases, there might be multiple paths to a node and it might be nice to use all of them for failure tolerance. It is not acceptable to list the node more than once with different IP addresses because that makes it look like more than one node which can cause problems with load balancing. Fourthly, I think it should be clarified that the node_name should be used as the key in the live_nodes directory for the ephemeral file corresponding to the node. If this seems reasonable I will attempt to make this change. Pretty close to reasonable in my book.
          Hide
          Mark Miller added a comment -

          The overseer will have to update this when nodes disappear

          We don't actually currently remove info about a shard when it goes away - this allows Solr to know that a part of the index is missing if all replicas in a slice go own. Eventually we will offer partial results with a warning in this case - currently we return an error. Instead, clients reconcile the cluster layout with the live nodes listing (ephemerals) to see what is up. It's expected that every shard that comes up and registers should at some point come back if it goes down - unless it's 'manually' removed from zk.

          Show
          Mark Miller added a comment - The overseer will have to update this when nodes disappear We don't actually currently remove info about a shard when it goes away - this allows Solr to know that a part of the index is missing if all replicas in a slice go own. Eventually we will offer partial results with a warning in this case - currently we return an error. Instead, clients reconcile the cluster layout with the live nodes listing (ephemerals) to see what is up. It's expected that every shard that comes up and registers should at some point come back if it goes down - unless it's 'manually' removed from zk.
          Hide
          Ted Dunning added a comment - - edited

          We don't actually currently remove info about a shard when it goes away

          That is fine.

          But there is a strong distinction between different information. I see three kinds of information about a shard that live in different places and are intended for different audiences.

          • the content and status of collections. This is intended as control input for the overseer and contains information about which collections exist and what partitions they contain and other policy information such as requested replication level. This is the information that should not be deleted
          • the shard assignments for nodes. This is generated by the overseer and intended for the nodes. When a node goes down, the assignments for that node should be deleted and other nodes should get other assignments.
          • the cluster state. This is generated by the nodes and the overseer and read by the search clients to let them know who to send queries to. This is the file that I was referring to. I was suggesting that the clauses that state that a particular node is serving requests for a particular shard should be deleted by the overseer when the node disappears. I can't imagine that there is any question about this because the clients have an urgent need to stop sending queries to downed nodes as soon as possible. Deleting a clause from this state doesn't forget about the shard; it just gives an accurate picture of who is serving a shard.

          I really think that separating the information like this makes it much simpler to keep track of what is going on, especially for the clients. They should only need to look one place for what they need to know. Besides, they can't really reconcile things correctly. If shards A,B,C were all served by node 3, then when node 3 goes down, the client can get things right. But if node 3 comes back and reloads A but not yet B and C, the client cannot know what the current state is unless somebody updates the state correctly.

          In my view, what should happen when node 3 goes down and then back up is this:

          • the ephemeral for node 3 in live_nodes disappears
          • the overseer removes the "node 3 has A, B, C" clause from the cluster state
          • the overseer assigns shards A, B and C to other nodes to start the replication process
          • node 6 registers that it is serving shard C
          • node 5 registers that it is serving shard B
          • node 3 comes back, creates the ephemeral in live_nodes and starts updating the indexes it has from the transaction log.
          • node 3 registers that it has shards A and B up-to-date because there have been no changes.
          • the overseer looks at the cluster state and decides that node 3 is under-utilized and shard B is over replicated. It unassigns shard B from node 5 and assigns shard D to node 3.
          • node 5 updates the cluster state to indicate it is no longer serving shard B
          • node 3 downloads a copy of shard D and replays the log for D to catch up the archived index.
          • node 3 registers that it is serving D

          and so on.

          The sequence of states for node 3 in this scenario is

          • serving A,B,C
          • down (not serving anything)
          • up (not serving anything)
          • serving A and B
          • serving A, B and D

          The client cannot derive this information accurately from simple liveness information. Therefore the node has to update the cluster state.

          Show
          Ted Dunning added a comment - - edited We don't actually currently remove info about a shard when it goes away That is fine. But there is a strong distinction between different information. I see three kinds of information about a shard that live in different places and are intended for different audiences. the content and status of collections. This is intended as control input for the overseer and contains information about which collections exist and what partitions they contain and other policy information such as requested replication level. This is the information that should not be deleted the shard assignments for nodes. This is generated by the overseer and intended for the nodes. When a node goes down, the assignments for that node should be deleted and other nodes should get other assignments. the cluster state. This is generated by the nodes and the overseer and read by the search clients to let them know who to send queries to. This is the file that I was referring to. I was suggesting that the clauses that state that a particular node is serving requests for a particular shard should be deleted by the overseer when the node disappears. I can't imagine that there is any question about this because the clients have an urgent need to stop sending queries to downed nodes as soon as possible. Deleting a clause from this state doesn't forget about the shard; it just gives an accurate picture of who is serving a shard. I really think that separating the information like this makes it much simpler to keep track of what is going on, especially for the clients. They should only need to look one place for what they need to know. Besides, they can't really reconcile things correctly. If shards A,B,C were all served by node 3, then when node 3 goes down, the client can get things right. But if node 3 comes back and reloads A but not yet B and C, the client cannot know what the current state is unless somebody updates the state correctly. In my view, what should happen when node 3 goes down and then back up is this: the ephemeral for node 3 in live_nodes disappears the overseer removes the "node 3 has A, B, C" clause from the cluster state the overseer assigns shards A, B and C to other nodes to start the replication process node 6 registers that it is serving shard C node 5 registers that it is serving shard B node 3 comes back, creates the ephemeral in live_nodes and starts updating the indexes it has from the transaction log. node 3 registers that it has shards A and B up-to-date because there have been no changes. the overseer looks at the cluster state and decides that node 3 is under-utilized and shard B is over replicated. It unassigns shard B from node 5 and assigns shard D to node 3. node 5 updates the cluster state to indicate it is no longer serving shard B node 3 downloads a copy of shard D and replays the log for D to catch up the archived index. node 3 registers that it is serving D and so on. The sequence of states for node 3 in this scenario is serving A,B,C down (not serving anything) up (not serving anything) serving A and B serving A, B and D The client cannot derive this information accurately from simple liveness information. Therefore the node has to update the cluster state.
          Hide
          Jamie Johnson added a comment -

          phew, don't pay attention for a few hours and lots to read....

          So I just need to summarize what needs to be done at this point.

          So we have
          /live_nodes which does not change
          /collections which does not change (although I am not completely sure what we'll use it for now)
          /cloudstate which contains the current cluster state.

          /cloudstate is updated by the individual nodes optimistically, if this fails we keep trying until it can update the right version.
          /live_nodes is the same as it is now.

          So in my mind I am not sure why we need the /collections instance anymore. If we maintain the state of the cluster in /cloudstate but don't remove the nodes that have gone down (as Mark mentioned) and just update their status, it seems like we should be able to do what we want. Now admittedly I have not gone through Ted's comment in detail so perhaps there is a nugget in there that I am missing.

          I have the basics of this running locally except the status of each replica in the cloudstate. I am assuming that I can leverage the leader logic that currently exists to add another watcher on live_node to see update the state when an instance goes down, otherwise the replica itself is responsible for updating its state (I believe that was what was said).

          <?xml version="1.0" encoding="UTF-8" ?>
          <cloudstate>
          	<collectionstate name="testcollection">
          		<shard name="shard0">
          			<replica name="solrtest.local:7574_solr_">
          				<str name="roles">searcher</str>
          				<str name="node_name">solrtest.local:7574_solr</str>
          				<str name="url">http://solrtest.local:7574/solr/</str>
          				<status>DISABLED</status>
          			</replica>
          			<replica name="solrtest.local:7575_solr_">
          				<str name="roles">searcher</str>
          				<str name="node_name">solrtest.local:7575_solr</str>
          				<str name="url">http://solrtest.local:7575/solr/</str>
          				<status>DEAD</status>
          			</replica>
          			<replica name="solrtest.local:8983_solr_">
          				<str name="roles">indexer,searcher</str>
          				<str name="node_name">solrtest.local:8983_solr</str>
          				<str name="url">http://solrtest.local:8983/solr/</str>
          				<status>ALIVE</status>
          			</replica>
          		</shard>
          	</collectionstate>
          </cloudstate>
          

          The status is obviously notional since I'm not really populating it now, but I'm thinking ALIVE, DEAD, DISABLED, any others?

          I'll try to put together a patch tonight which captures this (I'll see if I can add the logic for the leader as well to track replicas that die).

          Show
          Jamie Johnson added a comment - phew, don't pay attention for a few hours and lots to read.... So I just need to summarize what needs to be done at this point. So we have /live_nodes which does not change /collections which does not change (although I am not completely sure what we'll use it for now) /cloudstate which contains the current cluster state. /cloudstate is updated by the individual nodes optimistically, if this fails we keep trying until it can update the right version. /live_nodes is the same as it is now. So in my mind I am not sure why we need the /collections instance anymore. If we maintain the state of the cluster in /cloudstate but don't remove the nodes that have gone down (as Mark mentioned) and just update their status, it seems like we should be able to do what we want. Now admittedly I have not gone through Ted's comment in detail so perhaps there is a nugget in there that I am missing. I have the basics of this running locally except the status of each replica in the cloudstate. I am assuming that I can leverage the leader logic that currently exists to add another watcher on live_node to see update the state when an instance goes down, otherwise the replica itself is responsible for updating its state (I believe that was what was said). <?xml version= "1.0" encoding= "UTF-8" ?> <cloudstate> <collectionstate name= "testcollection" > <shard name= "shard0" > <replica name= "solrtest.local:7574_solr_" > <str name= "roles" > searcher </str> <str name= "node_name" > solrtest.local:7574_solr </str> <str name= "url" > http://solrtest.local:7574/solr/ </str> <status> DISABLED </status> </replica> <replica name= "solrtest.local:7575_solr_" > <str name= "roles" > searcher </str> <str name= "node_name" > solrtest.local:7575_solr </str> <str name= "url" > http://solrtest.local:7575/solr/ </str> <status> DEAD </status> </replica> <replica name= "solrtest.local:8983_solr_" > <str name= "roles" > indexer,searcher </str> <str name= "node_name" > solrtest.local:8983_solr </str> <str name= "url" > http://solrtest.local:8983/solr/ </str> <status> ALIVE </status> </replica> </shard> </collectionstate> </cloudstate> The status is obviously notional since I'm not really populating it now, but I'm thinking ALIVE, DEAD, DISABLED, any others? I'll try to put together a patch tonight which captures this (I'll see if I can add the logic for the leader as well to track replicas that die).
          Hide
          Ted Dunning added a comment -

          So in my mind I am not sure why we need the /collections instance anymore. If we maintain the state of the cluster in /cloudstate but don't remove the nodes that have gone down (as Mark mentioned) and just update their status, it seems like we should be able to do what we want. Now admittedly I have not gone through Ted's comment in detail so perhaps there is a nugget in there that I am missing.

          In a sound byte, collections is aspirational and cloudstate is actual. Collections specifies what we want to be true and cloudstate tells us what is true at this moment.

          Show
          Ted Dunning added a comment - So in my mind I am not sure why we need the /collections instance anymore. If we maintain the state of the cluster in /cloudstate but don't remove the nodes that have gone down (as Mark mentioned) and just update their status, it seems like we should be able to do what we want. Now admittedly I have not gone through Ted's comment in detail so perhaps there is a nugget in there that I am missing. In a sound byte, collections is aspirational and cloudstate is actual. Collections specifies what we want to be true and cloudstate tells us what is true at this moment.
          Hide
          Ted Dunning added a comment -

          So in my mind I am not sure why we need the /collections instance anymore. If we maintain the state of the cluster in /cloudstate but don't remove the nodes that have gone down (as Mark mentioned) and just update their status, it seems like we should be able to do what we want

          As I mentioned in the (too long) comment, these have to be removed in order to get an accurate view of current state. If we really want to have a slightly historical view, marking replicas as defunct might do, but as things move around you will just get a muddle. I think it is better to leave cloudstate with state as it is now and collections as state as it should be. For history, go to the logs.

          Btw... in English, we are consistently saying "cluster state", but the file is called "cloudstate". There seems a contradiction here that is easily remedied.

          Show
          Ted Dunning added a comment - So in my mind I am not sure why we need the /collections instance anymore. If we maintain the state of the cluster in /cloudstate but don't remove the nodes that have gone down (as Mark mentioned) and just update their status, it seems like we should be able to do what we want As I mentioned in the (too long) comment, these have to be removed in order to get an accurate view of current state. If we really want to have a slightly historical view, marking replicas as defunct might do, but as things move around you will just get a muddle. I think it is better to leave cloudstate with state as it is now and collections as state as it should be. For history, go to the logs. Btw... in English, we are consistently saying "cluster state", but the file is called "cloudstate". There seems a contradiction here that is easily remedied.
          Hide
          Mark Miller added a comment -

          The current method of dealing with downed nodes is not so bad - the cluster layout is compared with the live_nodes - this gives searchers the ability to know a node is down within the ephemeral timeout. Before that happens (a brief window), failed requests are simply retried on another replica. The searcher locally marks that the server is bad, and then periodically tries it again - unless the ephemeral goes down and it is no longer consulted.

          The client cannot derive this information accurately from simple liveness information.

          It's simply not supported that way currently - this is intentional though. If you want to change which shards a node is responsible for serving, you don't just bring it back up with fewer or different shards - you first delete the node info from the cluster layout, then you bring it up. We didn't mind that a variety of advanced scenarios require manual editing of the zk layout at the time. We have intended to move towards a separate model and state layout eventually though (see the solrcloud wiki page). That is essentially in the proposed path I think.

          I bias-ly lean against an overseer almost more than optimistic collection locks, but I have not had time to fully digest the latest proposed changes. I suppose that when you have a solid leader election process available, an overseer is fairly cheap, and if used for the right things, fairly simple. When we get into rebalancing (we don't plan to right away), I suppose we come back to it anyhow.

          marking replicas as defunct might do,

          Yeah, I think this gets complicated to do well in general. I like simple solutions like the one above. And I think good monitoring is a perfectly acceptable requirement for a very large cluster.

          It's good stuff to consider. Exploring all of these changes should likely be spun off into anther issue though. Advancements in how we handle all of this are a much larger issue than Shard/Node states.

          Show
          Mark Miller added a comment - The current method of dealing with downed nodes is not so bad - the cluster layout is compared with the live_nodes - this gives searchers the ability to know a node is down within the ephemeral timeout. Before that happens (a brief window), failed requests are simply retried on another replica. The searcher locally marks that the server is bad, and then periodically tries it again - unless the ephemeral goes down and it is no longer consulted. The client cannot derive this information accurately from simple liveness information. It's simply not supported that way currently - this is intentional though. If you want to change which shards a node is responsible for serving, you don't just bring it back up with fewer or different shards - you first delete the node info from the cluster layout, then you bring it up. We didn't mind that a variety of advanced scenarios require manual editing of the zk layout at the time. We have intended to move towards a separate model and state layout eventually though (see the solrcloud wiki page). That is essentially in the proposed path I think. I bias-ly lean against an overseer almost more than optimistic collection locks, but I have not had time to fully digest the latest proposed changes. I suppose that when you have a solid leader election process available, an overseer is fairly cheap, and if used for the right things, fairly simple. When we get into rebalancing (we don't plan to right away), I suppose we come back to it anyhow. marking replicas as defunct might do, Yeah, I think this gets complicated to do well in general. I like simple solutions like the one above. And I think good monitoring is a perfectly acceptable requirement for a very large cluster. It's good stuff to consider. Exploring all of these changes should likely be spun off into anther issue though. Advancements in how we handle all of this are a much larger issue than Shard/Node states.
          Hide
          Mark Miller added a comment -

          Also, Jamie, FYI this larger picture work will probably go best if done on the solrcloud branch as their are already a variety of changes and additions (like leader election / locking). If done against trunk, there is going to be a mess of dupe work and conflicts.

          Show
          Mark Miller added a comment - Also, Jamie, FYI this larger picture work will probably go best if done on the solrcloud branch as their are already a variety of changes and additions (like leader election / locking). If done against trunk, there is going to be a mess of dupe work and conflicts.
          Hide
          Mark Miller added a comment -

          Btw... in English, we are consistently saying "cluster state", but the file is called "cloudstate". There seems a contradiction here that is easily remedied.

          +1

          Show
          Mark Miller added a comment - Btw... in English, we are consistently saying "cluster state", but the file is called "cloudstate". There seems a contradiction here that is easily remedied. +1
          Hide
          Jamie Johnson added a comment -

          Ok, so here is what I currently have. The patch named cluster_state-file.patch creates a /clusterstate zk node with the following

          <?xml version="1.0" encoding="UTF-8" ?>
          <clusterstate>
          	<collectionstate name="testcollection">
          		<shard name="shard0">
          			<replica name="testhost:7574_solr_">
          				<str name="roles">searcher</str>
          				<str name="node_name">testhost:7574_solr</str>
          				<str name="url">http://testhost:7574/solr/</str>
          			</replica>
          			<replica name="testhost:8983_solr_">
          				<str name="roles">indexer,searcher</str>
          				<str name="node_name">testhost:8983_solr</str>
          				<str name="url">http://testhost:8983/solr/</str>
          			</replica>
          		</shard>
          	</collectionstate>
          </clusterstate>
          

          (Again state currently not implemented). This works like the original implementation in that you must verify the host is alive in live_nodes. I've added the logic back in (I'm not using it for anything though) to create the

          /collections/testcollection/shards/shard0/testhost:7574_solr_
          /collections/testcollection/shards/shard0/testhost:8983_solr_

          This allows us to theoretically do what Ted suggested earlier, namely actual state is in /clusterstate, aspirational state is in /collections but this requires that there be some cluster leader to manage this properly. I had this removed, but added it back for now. If we don't need it I'll nix it in another patch.

          Show
          Jamie Johnson added a comment - Ok, so here is what I currently have. The patch named cluster_state-file.patch creates a /clusterstate zk node with the following <?xml version= "1.0" encoding= "UTF-8" ?> <clusterstate> <collectionstate name= "testcollection" > <shard name= "shard0" > <replica name= "testhost:7574_solr_" > <str name= "roles" > searcher </str> <str name= "node_name" > testhost:7574_solr </str> <str name= "url" > http://testhost:7574/solr/ </str> </replica> <replica name= "testhost:8983_solr_" > <str name= "roles" > indexer,searcher </str> <str name= "node_name" > testhost:8983_solr </str> <str name= "url" > http://testhost:8983/solr/ </str> </replica> </shard> </collectionstate> </clusterstate> (Again state currently not implemented). This works like the original implementation in that you must verify the host is alive in live_nodes. I've added the logic back in (I'm not using it for anything though) to create the /collections/testcollection/shards/shard0/testhost:7574_solr_ /collections/testcollection/shards/shard0/testhost:8983_solr_ This allows us to theoretically do what Ted suggested earlier, namely actual state is in /clusterstate, aspirational state is in /collections but this requires that there be some cluster leader to manage this properly. I had this removed, but added it back for now. If we don't need it I'll nix it in another patch.
          Hide
          Jamie Johnson added a comment -

          I'm not sure if this is due to something that I modified or not, but LoadBalanced Queries and distributed queries seem to be broken in this patch. Specifically querying across a 2 shard cluster each with 50 results, gave me a count of 150 (obviously not right) and querying across a 1 shard cluster with a replica always hits the same server. I'll have to dig into why this is happening tomorrow, unless someone else has an idea what could be causing this?

          Show
          Jamie Johnson added a comment - I'm not sure if this is due to something that I modified or not, but LoadBalanced Queries and distributed queries seem to be broken in this patch. Specifically querying across a 2 shard cluster each with 50 results, gave me a count of 150 (obviously not right) and querying across a 1 shard cluster with a replica always hits the same server. I'll have to dig into why this is happening tomorrow, unless someone else has an idea what could be causing this?
          Hide
          Mark Miller added a comment -

          I've spun off:

          SOLR-2821 Improve how cluster state is managed in ZooKeeper.
          SOLR-2820 Add both model and state to ZooKeeper layout for SolrCloud.

          Even if we hit them with one patch, makes it easier to track these changes.

          Show
          Mark Miller added a comment - I've spun off: SOLR-2821 Improve how cluster state is managed in ZooKeeper. SOLR-2820 Add both model and state to ZooKeeper layout for SolrCloud. Even if we hit them with one patch, makes it easier to track these changes.
          Hide
          Mark Miller added a comment -

          Thanks for the patch Jaime! I'll be able to take a closer look at it later today.

          Show
          Mark Miller added a comment - Thanks for the patch Jaime! I'll be able to take a closer look at it later today.
          Hide
          Jamie Johnson added a comment -

          I'm preparing a new patch tonight, I believe I've narrowed down what is causing the issues I mentioned before, Also I am going back to ClusterState and Slice being immutable instead of locking.

          Show
          Jamie Johnson added a comment - I'm preparing a new patch tonight, I believe I've narrowed down what is causing the issues I mentioned before, Also I am going back to ClusterState and Slice being immutable instead of locking.
          Hide
          Jamie Johnson added a comment -

          The latest patch fixes the issue I had mentioned before, where live_nodes wasn't properly getting updated. This implementation also keeps ClusterState (used to be CloudState) and Slice immutable.

          The following is a list of things that can still be done:

          1. Consider removing ZkStateReader.updateCloudState methods, these aren't called by anything other than tests right now. On a side note, the current implementation processes every watch event instead of having the 5s delay. This could cause some issues performance wise when the cluster is first coming up since everyone will be trying to write to the cluster state. If we have to add that back it should be a really simple change.
          2. Update the tests to read from /cluster_state instead of /collections
          3. Decide if ClusterState should be the ideal state or actual (i.e. do we maintain information in /collections as ideal and update ClusterState to track nodes going down). Currently ClusterState is ideal. This would require some leader to track when a node is no longer live, so should probably be pushed off.

          Any thoughts on these? Specifically 1.

          Show
          Jamie Johnson added a comment - The latest patch fixes the issue I had mentioned before, where live_nodes wasn't properly getting updated. This implementation also keeps ClusterState (used to be CloudState) and Slice immutable. The following is a list of things that can still be done: Consider removing ZkStateReader.updateCloudState methods, these aren't called by anything other than tests right now. On a side note, the current implementation processes every watch event instead of having the 5s delay. This could cause some issues performance wise when the cluster is first coming up since everyone will be trying to write to the cluster state. If we have to add that back it should be a really simple change. Update the tests to read from /cluster_state instead of /collections Decide if ClusterState should be the ideal state or actual (i.e. do we maintain information in /collections as ideal and update ClusterState to track nodes going down). Currently ClusterState is ideal. This would require some leader to track when a node is no longer live, so should probably be pushed off. Any thoughts on these? Specifically 1.
          Hide
          Ted Dunning added a comment -

          Consider removing ZkStateReader.updateCloudState methods, these aren't called by anything other than tests right now. On a side note, the current implementation processes every watch event instead of having the 5s delay. This could cause some issues performance wise when the cluster is first coming up since everyone will be trying to write to the cluster state. If we have to add that back it should be a really simple change.

          Are these truly defunct? Or just not used yet?

          If the latter, I would keep them.

          I don't think that handling all watches is a problem at this point given that >1000 node search clusters probably don't exist.

          Show
          Ted Dunning added a comment - Consider removing ZkStateReader.updateCloudState methods, these aren't called by anything other than tests right now. On a side note, the current implementation processes every watch event instead of having the 5s delay. This could cause some issues performance wise when the cluster is first coming up since everyone will be trying to write to the cluster state. If we have to add that back it should be a really simple change. Are these truly defunct? Or just not used yet? If the latter, I would keep them. I don't think that handling all watches is a problem at this point given that >1000 node search clusters probably don't exist.
          Hide
          Jamie Johnson added a comment -

          ZkStateReader.updateCloudState could be used if we wanted to add the delay back. Otherwise it is not called.

          One thing we could do to improve the issue mentioned above is have replicas that are registering with the cloud check to see the current information in the ClusterState before storing it to ZK. This would mean that we only update on changes. I can look at doing that as well.

          Show
          Jamie Johnson added a comment - ZkStateReader.updateCloudState could be used if we wanted to add the delay back. Otherwise it is not called. One thing we could do to improve the issue mentioned above is have replicas that are registering with the cloud check to see the current information in the ClusterState before storing it to ZK. This would mean that we only update on changes. I can look at doing that as well.
          Hide
          Mark Miller added a comment -

          Hey jamie - trying to apply the latest patch, but looks like it already expects to find ClusterState? It's removing CloudState, but then trying to patch ClusterState it looks like. Does the latest patch apply cleanly for you?

          Show
          Mark Miller added a comment - Hey jamie - trying to apply the latest patch, but looks like it already expects to find ClusterState? It's removing CloudState, but then trying to patch ClusterState it looks like. Does the latest patch apply cleanly for you?
          Hide
          Mark Miller added a comment -

          I just hacked in the ClusterState.java class from the patch for now - seems to work from there.

          Show
          Mark Miller added a comment - I just hacked in the ClusterState.java class from the patch for now - seems to work from there.
          Hide
          Mark Miller added a comment -

          Since we probably want to do an svn rename to preserve history, it's actually likely easier to leave CloudSate in this patch, and do the rename to ClusterState in a different issue. Then a committer can do the proper refactoring - I can do it - I'll also rename all of the local vars from cloudState to clusterState.

          Show
          Mark Miller added a comment - Since we probably want to do an svn rename to preserve history, it's actually likely easier to leave CloudSate in this patch, and do the rename to ClusterState in a different issue. Then a committer can do the proper refactoring - I can do it - I'll also rename all of the local vars from cloudState to clusterState.
          Hide
          Jamie Johnson added a comment -

          Sorry for not responding faster, I did this pretty late last night so I didn't verify that it applied cleanly, I just assumed which is obviously the issue here. Sorry for not having a bit more QA on that.

          Show
          Jamie Johnson added a comment - Sorry for not responding faster, I did this pretty late last night so I didn't verify that it applied cleanly, I just assumed which is obviously the issue here. Sorry for not having a bit more QA on that.
          Hide
          Mark Miller added a comment -

          No worries, just alerting for the sake of anyone else applying the patch.

          Playing with the patch now - certainly, cluster state reads are dramatically faster now - that's great. It seems like it will be a little more annoying to make changes on the fly by a user - but I'm sure the tradeoff should be worth it.

          I do think I get a test fail or two still, but overall, it's looking good.

          Show
          Mark Miller added a comment - No worries, just alerting for the sake of anyone else applying the patch. Playing with the patch now - certainly, cluster state reads are dramatically faster now - that's great. It seems like it will be a little more annoying to make changes on the fly by a user - but I'm sure the tradeoff should be worth it. I do think I get a test fail or two still, but overall, it's looking good.
          Hide
          Jamie Johnson added a comment -

          I'll look into the test failures, as well. May be something simple to fix.

          Show
          Jamie Johnson added a comment - I'll look into the test failures, as well. May be something simple to fix.
          Hide
          Jamie Johnson added a comment -

          Yeah I see at least 1 issue, basically CloudStateUpdateTest manually updates the information in ZK which I'm not reading anymore. Should be easy to fix though.

          Show
          Jamie Johnson added a comment - Yeah I see at least 1 issue, basically CloudStateUpdateTest manually updates the information in ZK which I'm not reading anymore. Should be easy to fix though.
          Hide
          Jamie Johnson added a comment -

          So doing these incremental updates is a bit messier now in regards to testing. Where I could previously delete a shard or update data on a shard I can't do this anymore without rewriting the entire cluster state. Should I just create some strings which contain the cluster states for each step or would it be better to keep rebuilding the ClusterState object and trying to store that?

          Show
          Jamie Johnson added a comment - So doing these incremental updates is a bit messier now in regards to testing. Where I could previously delete a shard or update data on a shard I can't do this anymore without rewriting the entire cluster state. Should I just create some strings which contain the cluster states for each step or would it be better to keep rebuilding the ClusterState object and trying to store that?
          Hide
          Mark Miller added a comment -

          One thing we probably want to consider are the limitations we might hit as this XML file grows. EG - at some point, is grabbing a 800k file on every update going to be a lot worse than incremental updates against zknodes? Are we going to be okay with a max limit of nodes in the low 1000's? (Zk max file size is 1 MB - is that overridable without hacking a version?). We will likely add more meta data to each node over time.

          Also, should we consider JSON or YAML over XML? Solr has historically used XML for config, but I don't want that to unduly influence the right choice here. Solr also has JSON support (which I would like as better default for talking to Solr myself).

          Show
          Mark Miller added a comment - One thing we probably want to consider are the limitations we might hit as this XML file grows. EG - at some point, is grabbing a 800k file on every update going to be a lot worse than incremental updates against zknodes? Are we going to be okay with a max limit of nodes in the low 1000's? (Zk max file size is 1 MB - is that overridable without hacking a version?). We will likely add more meta data to each node over time. Also, should we consider JSON or YAML over XML? Solr has historically used XML for config, but I don't want that to unduly influence the right choice here. Solr also has JSON support (which I would like as better default for talking to Solr myself).
          Hide
          Jamie Johnson added a comment - - edited

          Definitely something worth considering. Moving to JSON reduces the issue, but does not fix it. My assumption is that the number of updates will be relatively low and infrequent. Switching a nodes roles may occur a little more frequently, but still shouldn't happen on a regular basis. My 1 shard with 2 replicas is about 600 bytes, in JSON format this is about 300.

          Switching to storing this in JSON should be fairly straight forward to do, it's really just a matter of updating the load and store methods in ClusterState.

          Ted,

          Do you have any thoughts on the size of this file?

          Show
          Jamie Johnson added a comment - - edited Definitely something worth considering. Moving to JSON reduces the issue, but does not fix it. My assumption is that the number of updates will be relatively low and infrequent. Switching a nodes roles may occur a little more frequently, but still shouldn't happen on a regular basis. My 1 shard with 2 replicas is about 600 bytes, in JSON format this is about 300. Switching to storing this in JSON should be fairly straight forward to do, it's really just a matter of updating the load and store methods in ClusterState. Ted, Do you have any thoughts on the size of this file?
          Hide
          Ted Dunning added a comment -

          Do you have any thoughts on the size of this file?

          Well, the prototype that I have half coded uses JSON if only because gson makes that choice easy, but the last time that I said that big JSON is no problem was in Mahout. After boldly making that statement, I started trying to store models that wound up as 100MB of JSON which good GB of RAM to parse.

          My current, revised position is that serialization fashions will change over time and they should always be pluggable. As such, let's stay with what is traditional for a bit longer. XML should not have any significant size penalty if we both to GZIP the results and uncompressed XML or JSON both win on readability which is the critical feature during development.

          If size really is a problem, then protobufs or Avro is a better answer than JSON anyway.

          Show
          Ted Dunning added a comment - Do you have any thoughts on the size of this file? Well, the prototype that I have half coded uses JSON if only because gson makes that choice easy, but the last time that I said that big JSON is no problem was in Mahout. After boldly making that statement, I started trying to store models that wound up as 100MB of JSON which good GB of RAM to parse. My current, revised position is that serialization fashions will change over time and they should always be pluggable. As such, let's stay with what is traditional for a bit longer. XML should not have any significant size penalty if we both to GZIP the results and uncompressed XML or JSON both win on readability which is the critical feature during development. If size really is a problem, then protobufs or Avro is a better answer than JSON anyway.
          Hide
          Jamie Johnson added a comment -

          Sounds good.

          Mark what additional things do we need before we can commit this to the branch?

          Show
          Jamie Johnson added a comment - Sounds good. Mark what additional things do we need before we can commit this to the branch?
          Hide
          Mark Miller added a comment -

          I have to properly review it and I can get it in. Hopefully fairly soon.

          One thing I think I noticed this morning is that we still keep properties on the old shard nodes that are redundant and unused? (url, roles, etc)

          Show
          Mark Miller added a comment - I have to properly review it and I can get it in. Hopefully fairly soon. One thing I think I noticed this morning is that we still keep properties on the old shard nodes that are redundant and unused? (url, roles, etc)
          Hide
          Jamie Johnson added a comment -

          Are you saying that we don't need those properties in /cluster_state or that we shouldn't be storing them in /collections anymore?

          I think it's the latter because I think we need the url and node name, but I could be wrong.

          Show
          Jamie Johnson added a comment - Are you saying that we don't need those properties in /cluster_state or that we shouldn't be storing them in /collections anymore? I think it's the latter because I think we need the url and node name, but I could be wrong.
          Hide
          Mark Miller added a comment -

          Right, I don't think we need them under /collections anymore? They will likely just get out of sync with the cluster_state - I think we only need that info in one spot.

          Can you do one thing? Back out the ClusterState rename from the patch? Then I can do the proper SVN moves for that class rename as a separate issue.

          Show
          Mark Miller added a comment - Right, I don't think we need them under /collections anymore? They will likely just get out of sync with the cluster_state - I think we only need that info in one spot. Can you do one thing? Back out the ClusterState rename from the patch? Then I can do the proper SVN moves for that class rename as a separate issue.
          Hide
          Mark Miller added a comment -

          Actually, isn't the whole node for each replica redundant now? Seems we don't need it at all anymore? We can still use the leader election and lock paths but the following nodes should go?

             /solr/collections/collection1/shards/shard1/127.0.0.1:39243_solr_ (0)
                DATA:
                    roles=null
                    node_name=127.0.0.1:39243_solr
                    url=http://127.0.0.1:39243/solr/
          
          Show
          Mark Miller added a comment - Actually, isn't the whole node for each replica redundant now? Seems we don't need it at all anymore? We can still use the leader election and lock paths but the following nodes should go? /solr/collections/collection1/shards/shard1/127.0.0.1:39243_solr_ (0) DATA: roles=null node_name=127.0.0.1:39243_solr url=http://127.0.0.1:39243/solr/
          Hide
          Jamie Johnson added a comment -

          My first patch actually removed them, but I had backed that out so we could support in the future one being the ideal state and one being the actual state. I'll back out the ClusterState name change and remove the information from the collections that we're not using.

          Show
          Jamie Johnson added a comment - My first patch actually removed them, but I had backed that out so we could support in the future one being the ideal state and one being the actual state. I'll back out the ClusterState name change and remove the information from the collections that we're not using.
          Hide
          Jamie Johnson added a comment -

          latest patch changes back from ClusterState to CloudState, removes the information under /collections/collection1/shards...

          Show
          Jamie Johnson added a comment - latest patch changes back from ClusterState to CloudState, removes the information under /collections/collection1/shards...
          Hide
          Mark Miller added a comment -

          latest patch giving me some trouble...doesn't seem to apply cleanly against ZkController, and also doesn't seem to compile because lots of refs to ClusterState left...also CloudState does not appear to be back? And there are a lot of classes in org.apache.solr.cloud.zookeeper - can we move those to org.apache.solr.common.cloud? If we want to partition them, org.apache.solr.common.cloud.watcher seems like it might be a better package name.

          Show
          Mark Miller added a comment - latest patch giving me some trouble...doesn't seem to apply cleanly against ZkController, and also doesn't seem to compile because lots of refs to ClusterState left...also CloudState does not appear to be back? And there are a lot of classes in org.apache.solr.cloud.zookeeper - can we move those to org.apache.solr.common.cloud? If we want to partition them, org.apache.solr.common.cloud.watcher seems like it might be a better package name.
          Hide
          Jamie Johnson added a comment -

          Apparently I'm not moving fast enough. The issue was that there were changes to trunk that I had not pulled. Hopefully this fixes the issue.

          Show
          Jamie Johnson added a comment - Apparently I'm not moving fast enough. The issue was that there were changes to trunk that I had not pulled. Hopefully this fixes the issue.
          Hide
          Mark Miller added a comment -

          Yeah, I figured that was the case with ZkController - but since I had the other compile errors (references to missing ClusterState calss) I didn't attempt to simply address the conflict in that class.

          Thanks for the new patch! I should be able to check it out in the morning.

          Show
          Mark Miller added a comment - Yeah, I figured that was the case with ZkController - but since I had the other compile errors (references to missing ClusterState calss) I didn't attempt to simply address the conflict in that class. Thanks for the new patch! I should be able to check it out in the morning.
          Hide
          Yonik Seeley added a comment -

          Just catching up on these discussions - lots of great stuff guys!

          Also, should we consider JSON or YAML over XML? Solr has historically used XML for config, but I don't want that to unduly influence the right choice here. Solr also has JSON support (which I would like as better default for talking to Solr myself).

          I think so. There tends to be a knee-jerk reaction against XML from many developers these days, and it seems like anything new being developed tends to use JSON. It's not about size or parsing speed - it's about what people will like the best. We need to pick our battles, and this one isn't worth fighting IMO.

          Another random thing to think about when designing this "schema" is the restart scenario.
          If a node comes back up, how does it know what shards it was previously serving? It either needs to be kept locally, or needs to be kept in zookeeper. It sounds like we're leaning toward keeping that info locally and then the node will advertise that when it comes back up (i.e. I have replica X in in the "recovering" state). Same for other per-node info like custom roles they were labeled with ("indexer", "searcher", etc)? It feels like we want some persistent node info around in ZK - it would certainly make monitoring and some other things easier - but then one must worry about garbage collection (nodes that you never see again).

          Sounds like the overseer should be the one deciding if a node that comes back up should try and recover it's former replica(s)
          or if it's been long enough (and another replica was created) that it should be assigned to a new task.

          Fun stuff! Unfortunately I need to go work on my Lucene Eurocon proposal now...

          Show
          Yonik Seeley added a comment - Just catching up on these discussions - lots of great stuff guys! Also, should we consider JSON or YAML over XML? Solr has historically used XML for config, but I don't want that to unduly influence the right choice here. Solr also has JSON support (which I would like as better default for talking to Solr myself). I think so. There tends to be a knee-jerk reaction against XML from many developers these days, and it seems like anything new being developed tends to use JSON. It's not about size or parsing speed - it's about what people will like the best. We need to pick our battles, and this one isn't worth fighting IMO. Another random thing to think about when designing this "schema" is the restart scenario. If a node comes back up, how does it know what shards it was previously serving? It either needs to be kept locally, or needs to be kept in zookeeper. It sounds like we're leaning toward keeping that info locally and then the node will advertise that when it comes back up (i.e. I have replica X in in the "recovering" state). Same for other per-node info like custom roles they were labeled with ("indexer", "searcher", etc)? It feels like we want some persistent node info around in ZK - it would certainly make monitoring and some other things easier - but then one must worry about garbage collection (nodes that you never see again). Sounds like the overseer should be the one deciding if a node that comes back up should try and recover it's former replica(s) or if it's been long enough (and another replica was created) that it should be assigned to a new task. Fun stuff! Unfortunately I need to go work on my Lucene Eurocon proposal now...
          Hide
          Jamie Johnson added a comment -

          Thanks for the thoughts Yonik. Changing the serialization is a very trivial thing, once this patch is on the branch I can switch it pretty easily, all of the logic for doing this is currently in the CloudState object, so very simple to do.

          Show
          Jamie Johnson added a comment - Thanks for the thoughts Yonik. Changing the serialization is a very trivial thing, once this patch is on the branch I can switch it pretty easily, all of the logic for doing this is currently in the CloudState object, so very simple to do.
          Hide
          Mark Miller added a comment -

          I'm looking into a couple test failures this morning with the latest patch.

          Show
          Mark Miller added a comment - I'm looking into a couple test failures this morning with the latest patch.
          Hide
          Mark Miller added a comment -

          Okay, I've done a bit of cleanup and I am committing - we can iterate from there.

          Show
          Mark Miller added a comment - Okay, I've done a bit of cleanup and I am committing - we can iterate from there.
          Hide
          Mark Miller added a comment -

          Thanks!

          Show
          Mark Miller added a comment - Thanks!

            People

            • Assignee:
              Mark Miller
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development