Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      clusterstate.json is a single point of contention for all components in SolrCloud. It would be hard to scale SolrCloud beyond a few thousand nodes because there are too many updates and too many nodes need to be notified of the changes. As the no:of nodes go up the size of clusterstate.json keeps going up and it will soon exceed the limit impossed by ZK.

      The first step is to store the shards information in separate nodes and each node can just listen to the shard node it belongs to. We may also need to split each collection into its own node and the clusterstate.json just holding the names of the collections .

      This is an umbrella issue

        Activity

        Noble Paul created issue -
        Hide
        Mark Miller added a comment -

        I think this is just one of many issues you will hit at a few thousand nodes currently.

        Show
        Mark Miller added a comment - I think this is just one of many issues you will hit at a few thousand nodes currently.
        Hide
        Mark Miller added a comment -

        size of clusterstate.json keeps going up and it will soon exceed the limit imposed by ZK.

        That limit is adjustable - I think even at a couple thousand nodes you are only talking a couple/few MB at most, which moves pretty quick over a fast network.

        I'm not saying we shouldn't look at this, but my testing of this at 1000 nodes was pretty smooth, so I would guess a couple thousands nodes is also reasonable - and to my knowledge there is no one approaching that scale with SolrCloud currently.

        Show
        Mark Miller added a comment - size of clusterstate.json keeps going up and it will soon exceed the limit imposed by ZK. That limit is adjustable - I think even at a couple thousand nodes you are only talking a couple/few MB at most, which moves pretty quick over a fast network. I'm not saying we shouldn't look at this, but my testing of this at 1000 nodes was pretty smooth, so I would guess a couple thousands nodes is also reasonable - and to my knowledge there is no one approaching that scale with SolrCloud currently.
        Hide
        Noble Paul added a comment - - edited

        [~hakeber] The requirement is to scale to 100,000's of nodes.

        each STATE command would mean every node will need to download that entire clusterstate.json and soon we will break ZK.

        Show
        Noble Paul added a comment - - edited [~hakeber] The requirement is to scale to 100,000's of nodes. each STATE command would mean every node will need to download that entire clusterstate.json and soon we will break ZK.
        Hide
        Mark Miller added a comment - - edited

        We have not even nailed 1000 nodes fully yet - seems silly to start working on 100,000's.

        Show
        Mark Miller added a comment - - edited We have not even nailed 1000 nodes fully yet - seems silly to start working on 100,000's.
        Hide
        Noble Paul added a comment -

        We need to eliminate the known bottlenecks if we need to scale. Is there any other obvious issue we need to address to scale beyond the current limit?

        Show
        Noble Paul added a comment - We need to eliminate the known bottlenecks if we need to scale. Is there any other obvious issue we need to address to scale beyond the current limit?
        Hide
        Noble Paul added a comment -

        ZK documentation says 1mb is the recommended limit

        http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Data+Access

        Show
        Noble Paul added a comment - ZK documentation says 1mb is the recommended limit http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Data+Access
        Hide
        Mark Miller added a comment -

        ZK documentation says 1mb is the recommended limit

        That's because it's kept in RAM and they want to discourage bad patterns. 1mb has not scaled with networks and hardware though - it's arbitrary to say 1mb and not 3mb (which handles thousands of nodes). 3mb will perform just as well as 1 mb. With modern servers ram and network speed, this stuff flies around easily - I saw that on my 1000 node test - the UI was main bottleneck there - it takes a long time to render the cloud screen due to the rendering speed.

        We also are not constantly working with large files - in a steady state we dont pull or push large files at all to ZK - it's only on a cluster state change. All of this makes 1 mb or 5 mb pretty irrelevant for us - you can test it out and see.

        Show
        Mark Miller added a comment - ZK documentation says 1mb is the recommended limit That's because it's kept in RAM and they want to discourage bad patterns. 1mb has not scaled with networks and hardware though - it's arbitrary to say 1mb and not 3mb (which handles thousands of nodes). 3mb will perform just as well as 1 mb. With modern servers ram and network speed, this stuff flies around easily - I saw that on my 1000 node test - the UI was main bottleneck there - it takes a long time to render the cloud screen due to the rendering speed. We also are not constantly working with large files - in a steady state we dont pull or push large files at all to ZK - it's only on a cluster state change. All of this makes 1 mb or 5 mb pretty irrelevant for us - you can test it out and see.
        Hide
        Mark Miller added a comment -

        I do agree that it becomes relevant once you are talking 10,000 100,000 nodes, etc. But like I said, we have not even proved out a couple thousand nodes, so it seems like we are getting ahead if we are already focusing on 100,000 node issues.

        Show
        Mark Miller added a comment - I do agree that it becomes relevant once you are talking 10,000 100,000 nodes, etc. But like I said, we have not even proved out a couple thousand nodes, so it seems like we are getting ahead if we are already focusing on 100,000 node issues.
        Hide
        Jack Krupansky added a comment -

        It might be helpful to clarify the intended use cases. I mean, normally, traditionally, a "cluster" is some number of nodes that have an application need to talk with each other, such as a Solr query fanning out to shards and then aggregating the results.

        So, are we really talking about individual collections with many thousands of shards?

        Or, are we talking more about having many thousands of collections, each of which may only have a rather modest number of shards?

        And, are we talking about "multitenancy" here?

        Show
        Jack Krupansky added a comment - It might be helpful to clarify the intended use cases. I mean, normally, traditionally, a "cluster" is some number of nodes that have an application need to talk with each other, such as a Solr query fanning out to shards and then aggregating the results. So, are we really talking about individual collections with many thousands of shards? Or, are we talking more about having many thousands of collections, each of which may only have a rather modest number of shards? And, are we talking about "multitenancy" here?
        Hide
        Yago Riveiro added a comment -

        I hit the ZK limit of 1M for node with more than 10K with 3 shard and replicationFactor=2.

        I found a workaround for this using the -Djute.maxbuffer parameter configured on ZK and Solr, but the ZK's documentation says that can be unstable.

        I don't know if the fact of have a clusterstate.json with so many collections can degrade the performance, but is too difficult to manage.

        If each collection had its own clusterstate.json, maybe migrate collection to other cluster will be more easy, you only need to copy the clusterstate to other cluster, the folders of cores and it's done. You had a problematic collection with its own resources.

        Show
        Yago Riveiro added a comment - I hit the ZK limit of 1M for node with more than 10K with 3 shard and replicationFactor=2. I found a workaround for this using the -Djute.maxbuffer parameter configured on ZK and Solr, but the ZK's documentation says that can be unstable. I don't know if the fact of have a clusterstate.json with so many collections can degrade the performance, but is too difficult to manage. If each collection had its own clusterstate.json, maybe migrate collection to other cluster will be more easy, you only need to copy the clusterstate to other cluster, the folders of cores and it's done. You had a problematic collection with its own resources.
        Hide
        Mark Miller added a comment -

        If each collection had its own clusterstate.json, maybe migrate collection to other cluster will be more easy,

        I think that is the low hanging first step - it's come up before and I think that is a reasonable idea even for smaller clusters.

        Show
        Mark Miller added a comment - If each collection had its own clusterstate.json, maybe migrate collection to other cluster will be more easy, I think that is the low hanging first step - it's come up before and I think that is a reasonable idea even for smaller clusters.
        Hide
        Mark Miller added a comment -

        The first step is to store the shards information in separate nodes and each node can just listen to the shard node it belongs to. We may also need to split each collection into its own node and the clusterstate.json just holding the names of the collections .

        We actually used to have a very fine grained layout similar to this - it had some advantages, but it was very expensive it turned out, because of having to do so many calls to load the state. I was also never very happy with the number watchers that we were juggling. I think we want to find the right balance, or perhaps see if ZooKeeper has gained the ability to read multiple nodes in a single call.

        Show
        Mark Miller added a comment - The first step is to store the shards information in separate nodes and each node can just listen to the shard node it belongs to. We may also need to split each collection into its own node and the clusterstate.json just holding the names of the collections . We actually used to have a very fine grained layout similar to this - it had some advantages, but it was very expensive it turned out, because of having to do so many calls to load the state. I was also never very happy with the number watchers that we were juggling. I think we want to find the right balance, or perhaps see if ZooKeeper has gained the ability to read multiple nodes in a single call.
        Hide
        Noble Paul added a comment -

        We also are not constantly working with large files - in a steady state we dont pull or push large files at all to ZK - it's only on a cluster state change

        In a big enough cluster you can expect a state change event almost every few seconds. So , it is not ideal to update the state on each node all the time

        If each collection had its own clusterstate.json, maybe migrate collection to other cluster will be more easy,

        Yes it is a low hanging fruit. probably easier to implement than separating out shards

        but it was very expensive it turned out, because of having to do so many calls to load the state

        This would be a very wrong approach. Each node does not need to be aware of every other node in the cluster. A node may only be aware of the shards it is a member of. It really does not have to load the state of other shards. The only instance when a node needs to know about the state of other shards is when it needs to forward a request. That information can be looked up on demand and cached. The cache can be invalidated when a request is fired to a wrong node .Each request would say that this request is for collection/shard/range .If the assumption is wrong the node would throw an appropriate exception . The sender can invalidate the cache and refresh the state

        As I see it , SolrCloud cluster is a cluster of shards. A shard is the logical unit . Nobody should need to watch other shards on a realtime basis. In a very large cluster, requests would rarely span across shards because the data would be partitioned in such a way that the queries/updates would be contained within the shard itself.

        Show
        Noble Paul added a comment - We also are not constantly working with large files - in a steady state we dont pull or push large files at all to ZK - it's only on a cluster state change In a big enough cluster you can expect a state change event almost every few seconds. So , it is not ideal to update the state on each node all the time If each collection had its own clusterstate.json, maybe migrate collection to other cluster will be more easy, Yes it is a low hanging fruit. probably easier to implement than separating out shards but it was very expensive it turned out, because of having to do so many calls to load the state This would be a very wrong approach. Each node does not need to be aware of every other node in the cluster. A node may only be aware of the shards it is a member of. It really does not have to load the state of other shards. The only instance when a node needs to know about the state of other shards is when it needs to forward a request. That information can be looked up on demand and cached. The cache can be invalidated when a request is fired to a wrong node .Each request would say that this request is for collection/shard/range .If the assumption is wrong the node would throw an appropriate exception . The sender can invalidate the cache and refresh the state As I see it , SolrCloud cluster is a cluster of shards. A shard is the logical unit . Nobody should need to watch other shards on a realtime basis. In a very large cluster, requests would rarely span across shards because the data would be partitioned in such a way that the queries/updates would be contained within the shard itself.
        Hide
        Mark Miller added a comment -

        In a big enough cluster you can expect a state change event almost every few seconds. So , it is not ideal to update the state on each node all the time

        I don't believe that is the case for a couple thousand node cluster. And while not ideal, the size of the file at a couple thousand nodes and network speeds keeps it from being any kind of bottle neck until you can scale well beyond what SolrCloud can do right now due to many other factors I think.

        Show
        Mark Miller added a comment - In a big enough cluster you can expect a state change event almost every few seconds. So , it is not ideal to update the state on each node all the time I don't believe that is the case for a couple thousand node cluster. And while not ideal, the size of the file at a couple thousand nodes and network speeds keeps it from being any kind of bottle neck until you can scale well beyond what SolrCloud can do right now due to many other factors I think.
        Hide
        Noble Paul added a comment -

        Jack Krupansky

        It might be helpful to clarify the intended use cases

        bq are we really talking about individual collections with many thousands of shards?

        Or, are we talking more about having many thousands of collections, each of which may only have a rather modest number of shards?

        Actually this is an umbrella issue to address both.

        I would probably attack the the later case first. I guess it would be easier than having 10000s of shards

        Show
        Noble Paul added a comment - Jack Krupansky It might be helpful to clarify the intended use cases bq are we really talking about individual collections with many thousands of shards? Or, are we talking more about having many thousands of collections, each of which may only have a rather modest number of shards? Actually this is an umbrella issue to address both. I would probably attack the the later case first. I guess it would be easier than having 10000s of shards
        Hide
        Mark Miller added a comment -

        The only instance when a node needs to know about the state of other shards is when it needs to forward a request.

        It would like to know the state of the other shards as it often has to query against all the shards in a collection and we don't want to keep hitting a shard that is down and we don't want to take forever to hit a shard that just came up ...

        That information can be looked up on demand and cached.

        I don't think I'm sold on that yet. Perhaps you need to do stuff like that for truly massive clusters, but until we proof point everything on a few thousand nodes, it doesn't seem worth this kind of change to me.

        Nobody should need to watch other shards on a realtime basis. I

        I do think it's important that we continue to propagate cluster state changes relatively quickly after they happen.

        Show
        Mark Miller added a comment - The only instance when a node needs to know about the state of other shards is when it needs to forward a request. It would like to know the state of the other shards as it often has to query against all the shards in a collection and we don't want to keep hitting a shard that is down and we don't want to take forever to hit a shard that just came up ... That information can be looked up on demand and cached. I don't think I'm sold on that yet. Perhaps you need to do stuff like that for truly massive clusters, but until we proof point everything on a few thousand nodes, it doesn't seem worth this kind of change to me. Nobody should need to watch other shards on a realtime basis. I I do think it's important that we continue to propagate cluster state changes relatively quickly after they happen.
        Hide
        Noble Paul added a comment -

        The other bottleneck is the single work queue at the overseer level. we should have a queue per collection and subsequently , we can have a queue per shard. Overseer can have a pool of threads to process the queues instead of a single thread as it has now

        Show
        Noble Paul added a comment - The other bottleneck is the single work queue at the overseer level. we should have a queue per collection and subsequently , we can have a queue per shard. Overseer can have a pool of threads to process the queues instead of a single thread as it has now
        Hide
        Noble Paul added a comment - - edited

        t would like to know the state of the other shards as it often has to query against all the shards in a collection

        You are missing the point that it's very unlikely for anyone to query across all shards in a VERY LARGE cluster. It is going to be almost useless and will bring the whole cluster down to a crawl. In a VERY LARGE cluster a node needs to know about other shards only when it gets a request/update for another shard. But even that may be a rare case if you use SolrJ as your client which will route requests at the client level. We will have to attack this problem sooner or later if we actually take SolrCloud to a very large scale .I'm not saying this has to be the first step. We will pluck the low hanging fruits first of course

        Show
        Noble Paul added a comment - - edited t would like to know the state of the other shards as it often has to query against all the shards in a collection You are missing the point that it's very unlikely for anyone to query across all shards in a VERY LARGE cluster. It is going to be almost useless and will bring the whole cluster down to a crawl. In a VERY LARGE cluster a node needs to know about other shards only when it gets a request/update for another shard. But even that may be a rare case if you use SolrJ as your client which will route requests at the client level. We will have to attack this problem sooner or later if we actually take SolrCloud to a very large scale .I'm not saying this has to be the first step. We will pluck the low hanging fruits first of course
        Hide
        Noble Paul added a comment -

        Mark Miller as the man who conceived SolrCloud, I'm sure you will have more insight into the problems with different approaches.

        Do we have a choice of not scaling to a VERY LARGE cluster ? I think it will be suicidal.

        What we need to do is identify and list the the low hanging fruits and solve them one by one.

        Show
        Noble Paul added a comment - Mark Miller as the man who conceived SolrCloud, I'm sure you will have more insight into the problems with different approaches. Do we have a choice of not scaling to a VERY LARGE cluster ? I think it will be suicidal. What we need to do is identify and list the the low hanging fruits and solve them one by one.
        Hide
        Mark Miller added a comment -

        Do we have a choice of not scaling to a VERY LARGE cluster ? I think it will be suicidal.

        My point is simple - I have said, yes, for a cluster over 10k nodes, some extra hoops are necessary.

        We are not currently stable at a cluster 1/10 that size or less. So jumping 10k cluster hoops when we can't properly scale to way less than that just seems like introducing more complexity and opportunity for new bugs before we are even stable on a much smaller scale - a scale that works with something close to the current architecture very nicely - and one that we have been slowly hardening.

        Show
        Mark Miller added a comment - Do we have a choice of not scaling to a VERY LARGE cluster ? I think it will be suicidal. My point is simple - I have said, yes, for a cluster over 10k nodes, some extra hoops are necessary. We are not currently stable at a cluster 1/10 that size or less. So jumping 10k cluster hoops when we can't properly scale to way less than that just seems like introducing more complexity and opportunity for new bugs before we are even stable on a much smaller scale - a scale that works with something close to the current architecture very nicely - and one that we have been slowly hardening.
        Hide
        Mark Miller added a comment -

        P.S. Long term, of course I'd like to scale to infinity. And I'm not saying please don't work on this issue - I'm voicing my concerns early so that they are not a surprise later. I think there is a lot we can do here - but I worry about doing too much for 10k nodes - I worry about the amount of refactoring needed and how we are still not as stable as we need to be at a much smaller scale.

        Show
        Mark Miller added a comment - P.S. Long term, of course I'd like to scale to infinity. And I'm not saying please don't work on this issue - I'm voicing my concerns early so that they are not a surprise later. I think there is a lot we can do here - but I worry about doing too much for 10k nodes - I worry about the amount of refactoring needed and how we are still not as stable as we need to be at a much smaller scale.
        Hide
        Yago Riveiro added a comment -

        SolrCloud environment is young and has some bugs but is relatively stable, make an epic refactoring can be worse than the actual scenario. Stability must be the goal.

        Show
        Yago Riveiro added a comment - SolrCloud environment is young and has some bugs but is relatively stable, make an epic refactoring can be worse than the actual scenario. Stability must be the goal.
        Hide
        Noble Paul added a comment -

        Mark Miller Whatever we are building, will not be committed to trunk anytime soon . This being a big ticket item it will probably have to live in another branch for a while for others to see and evaluate. But someone has to start it at some point

        If you have to identify the top 5 low hanging fruits what would they be?

        Show
        Noble Paul added a comment - Mark Miller Whatever we are building, will not be committed to trunk anytime soon . This being a big ticket item it will probably have to live in another branch for a while for others to see and evaluate. But someone has to start it at some point If you have to identify the top 5 low hanging fruits what would they be?
        Hide
        Jack Krupansky added a comment -

        You are missing the point that it's very unlikely for anyone to query across all shards in a VERY LARGE cluster.

        This gets back to my appeal for clarity on use cases. I mean, by definition, isn't the most common query case going to query across all shards of a collection? Sure, I suppose you could have an application with custom sharding such that the app always knows what shard will have the desired query results, such as a multitenant app which shards based on the userid field, but... isn't that a special case rather than a common case?

        Now, maybe you meant simply to say that collections would tend to be smaller, but... you didn't explicitly say that.

        So, once again, let's have some clarity about how many collections, how many shards per collection, and how many replicas per shard would need to be handled for various use cases of a proposed "very large" cluster.

        Show
        Jack Krupansky added a comment - You are missing the point that it's very unlikely for anyone to query across all shards in a VERY LARGE cluster. This gets back to my appeal for clarity on use cases. I mean, by definition, isn't the most common query case going to query across all shards of a collection? Sure, I suppose you could have an application with custom sharding such that the app always knows what shard will have the desired query results, such as a multitenant app which shards based on the userid field, but... isn't that a special case rather than a common case? Now, maybe you meant simply to say that collections would tend to be smaller, but... you didn't explicitly say that. So, once again, let's have some clarity about how many collections, how many shards per collection, and how many replicas per shard would need to be handled for various use cases of a proposed "very large" cluster.
        Hide
        Noble Paul added a comment -

        isn't the most common query case going to query across all shards of a collection?

        If you have 10,000s of shards any distributed search across all the shards will be too slow/expensive. The most common usecase in that scale would be a search that spans a single shard or a handful of shards . (It is not custom sharding , it is probably going to use the CompositeId router). If you are building a personalized website serving millions of users, this would be the common usecase . e.g: mail service , file storage service, geographically localized search etc.

        Now, maybe you meant simply to say that collections would tend to be smaller

        I don't wish to limit scaling to large no:of small collections or vice versa. That should be the choice of the user. But I can prioritize one kind of scaling over other. 10,000's of collections or 100,000s of shards would be the ultimate aim. We won't reach there in one step .it has to be iterative

        So, once again, let's have some clarity about how many collections

        The point is , we didn't build SolrCloud with a specific number in mind. The objective was to scale as much as possible. The next logical step would be to scale a lot higher by eliminating the known bottlenecks one by one.

        Show
        Noble Paul added a comment - isn't the most common query case going to query across all shards of a collection? If you have 10,000s of shards any distributed search across all the shards will be too slow/expensive. The most common usecase in that scale would be a search that spans a single shard or a handful of shards . (It is not custom sharding , it is probably going to use the CompositeId router). If you are building a personalized website serving millions of users, this would be the common usecase . e.g: mail service , file storage service, geographically localized search etc. Now, maybe you meant simply to say that collections would tend to be smaller I don't wish to limit scaling to large no:of small collections or vice versa. That should be the choice of the user. But I can prioritize one kind of scaling over other. 10,000's of collections or 100,000s of shards would be the ultimate aim. We won't reach there in one step .it has to be iterative So, once again, let's have some clarity about how many collections The point is , we didn't build SolrCloud with a specific number in mind. The objective was to scale as much as possible. The next logical step would be to scale a lot higher by eliminating the known bottlenecks one by one.
        Hide
        Noble Paul added a comment - - edited

        OK ,
        here is the plan to split clusterstate on a per collection basis

        How to use this feature?

        Introduce a new option while creating a collection (external=true) . This will keep the state of the collection in a separate node.
        example :

        http://localhost:8983/solr/admin/collections?action=CREATE&name=xcoll&numShards=5&replicationFactor=2&external=true

        This will result in this following entry in clusterstate.json

        {
         “xcoll” : {“ex”:true}
        }
        

        there will be another ZK entry which carries the actual collection information

        • /collections
          • /xcoll
            • /state.json
              {"xcoll":{
                  "shards":{"shard1":{
                      "range":”80000000-b332ffff”l,
                      "state":"active",
                      "replicas":{
                         "core_node1":{
                                "state":"active",
                                "base_url":"http://192.168.1.5:8983/solr",
                                 "core":"xcoll_shard1_replica1",
                          "node_name":"192.168.1.5:8983_solr",
                          "leader":"true"}}}},
                  "router":{"name":"compositeId"}}}
              

        The main Overseer thread is responsible for creating collections and managing all the events for all the collections in the clusterstate.json . clusterstate.json is modified only when a collection is created/deleted or when state updates happen to “non-external” collections

        Each external collection to have its own Overseer queue as follows. There will be a separate thread for each external collection.

        • /collections
          • /xcoll
            • /overseer
              • /collection-queue-work
              • /queue
              • /queue-work

        SolrJ enhancements

        SolrJ would not listen to any ZK node. When a request comes for a collection ‘xcoll’

        • it would first check if such a collection exists
        • If yes it first looks up the details in the local cache for that collection
        • If not found in cache , it fetches the node /collections/xcoll/state.json and caches the information
        • Any query/update will be sent with extra query param specifying the collection name , shard name, Role (Leader/Replica), and range (example _target_=xcoll:shard1:L:80000000-b332ffff) . A node would throw an error (INVALID_NODE) if it does not the serve the collection/shard/Role/range combo.
        • If a SolrJ gets INVALID_NODE error it would invalidate the cache and fetch fresh state information for that collection (and caches it again).

        Changes to each Solr Node

        Each node would only listen to the clusterstate.json and the states of collections which it is a member of. If a request comes for a collection it does not serve, it first checks for the _target_ param. All collections present in the clusterstate.json will be deemed as collections it serves

        • If the param is present and the node does not serve that collection/shard/Role/Range combo, an INVALID_NODE error is thrown
          • If the validation succeeds it is served
        • If the param is not present and the node is a member of the collection, the request is served by
          • If the node is not a member of the collection, it uses SolrJ to proxy the request to appropriate location

        Internally , the node really does not care about the state of external collections. If/when it is required, the information is fetched real time from ZK and used and thrown away.

        Changes to admin GUI

        External collections are not shown graphically in the admin UI .

        Show
        Noble Paul added a comment - - edited OK , here is the plan to split clusterstate on a per collection basis How to use this feature? Introduce a new option while creating a collection (external=true) . This will keep the state of the collection in a separate node. example : http://localhost:8983/solr/admin/collections?action=CREATE&name=xcoll&numShards=5&replicationFactor=2&external=true This will result in this following entry in clusterstate.json { “xcoll” : {“ex”: true } } there will be another ZK entry which carries the actual collection information /collections /xcoll /state.json {"xcoll":{ "shards":{"shard1":{ "range":”80000000-b332ffff”l, "state":"active", "replicas":{ "core_node1":{ "state":"active", "base_url":"http://192.168.1.5:8983/solr", "core":"xcoll_shard1_replica1", "node_name":"192.168.1.5:8983_solr", "leader":" true "}}}}, "router":{"name":"compositeId"}}} The main Overseer thread is responsible for creating collections and managing all the events for all the collections in the clusterstate.json . clusterstate.json is modified only when a collection is created/deleted or when state updates happen to “non-external” collections Each external collection to have its own Overseer queue as follows. There will be a separate thread for each external collection. /collections /xcoll /overseer /collection-queue-work /queue /queue-work SolrJ enhancements SolrJ would not listen to any ZK node. When a request comes for a collection ‘xcoll’ it would first check if such a collection exists If yes it first looks up the details in the local cache for that collection If not found in cache , it fetches the node /collections/xcoll/state.json and caches the information Any query/update will be sent with extra query param specifying the collection name , shard name, Role (Leader/Replica), and range (example _target_=xcoll:shard1:L:80000000-b332ffff) . A node would throw an error (INVALID_NODE) if it does not the serve the collection/shard/Role/range combo. If a SolrJ gets INVALID_NODE error it would invalidate the cache and fetch fresh state information for that collection (and caches it again). Changes to each Solr Node Each node would only listen to the clusterstate.json and the states of collections which it is a member of. If a request comes for a collection it does not serve, it first checks for the _target_ param. All collections present in the clusterstate.json will be deemed as collections it serves If the param is present and the node does not serve that collection/shard/Role/Range combo, an INVALID_NODE error is thrown If the validation succeeds it is served If the param is not present and the node is a member of the collection, the request is served by If the node is not a member of the collection, it uses SolrJ to proxy the request to appropriate location Internally , the node really does not care about the state of external collections. If/when it is required, the information is fetched real time from ZK and used and thrown away. Changes to admin GUI External collections are not shown graphically in the admin UI .
        Hide
        Yago Riveiro added a comment -

        There will be a separate thread for each external collection

        If we have 100K collections means that we need 100K threads?

        They are spread around the all machines of the cluster but it's still too much.

        I can be wrong but If we have 100K collections and only a 10% active at a time, we need allocate resource to the 100K theads.

        Is it not possible have a pool with X threads (X can be configurable) that treats external collections?

        Show
        Yago Riveiro added a comment - There will be a separate thread for each external collection If we have 100K collections means that we need 100K threads? They are spread around the all machines of the cluster but it's still too much. I can be wrong but If we have 100K collections and only a 10% active at a time, we need allocate resource to the 100K theads. Is it not possible have a pool with X threads (X can be configurable) that treats external collections?
        Hide
        Noble Paul added a comment -

        Is it not possible have a pool with X threads (X can be configurable) that treats external collections?

        Makes sense.

        Show
        Noble Paul added a comment - Is it not possible have a pool with X threads (X can be configurable) that treats external collections? Makes sense.
        Hide
        Mark Miller added a comment -

        Noble and I have discussed his proposal above at revolution and come to some agreement.

        Show
        Mark Miller added a comment - Noble and I have discussed his proposal above at revolution and come to some agreement.

          People

          • Assignee:
            Noble Paul
            Reporter:
            Noble Paul
          • Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 2,016h
              2,016h
              Remaining:
              Remaining Estimate - 2,016h
              2,016h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development