Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, Trunk
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      In this mode SolrJ would not watch any ZK node

      It fetches the state on demand and cache the most recently used n collections in memory.

      SolrJ would not listen to any ZK node. When a request comes for a collection ‘xcoll’
      it would first check if such a collection exists
      If yes it first looks up the details in the local cache for that collection
      If not found in cache , it fetches the node /collections/xcoll/state.json and caches the information
      Any query/update will be sent with extra query param specifying the collection name , version (example _stateVer=xcoll:34) . A node would throw an error (INVALID_NODE) if it does not have the right version
      If SolrJ gets INVALID_NODE error it would invalidate the cache and fetch fresh state information for that collection (and caches it again)

      If there is a connection timeout, SolrJ assumes the node is down and re-fetch the state for the collection and try again

      1. fail.logs
        492 kB
        Mark Miller
      2. SOLR-5474.patch
        41 kB
        Timothy Potter
      3. SOLR-5474.patch
        39 kB
        Timothy Potter
      4. SOLR-5474.patch
        33 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          Otis Gospodnetic added a comment -

          Noble: can you share the use case - why is this valuable, what situations is this meant for, what issues does it solve? Thanks.
          Note: this is something I generally miss and would love to see people include this sort of information when they open JIRA issues. This would help everyone trying to follow Solr development better understand what is going on and why.

          Show
          Otis Gospodnetic added a comment - Noble: can you share the use case - why is this valuable, what situations is this meant for, what issues does it solve? Thanks. Note: this is something I generally miss and would love to see people include this sort of information when they open JIRA issues. This would help everyone trying to follow Solr development better understand what is going on and why.
          Hide
          Noble Paul added a comment -

          The parent issue is to enable Solr to deal with a large no:of collections and hence we are going to see an explosion of the no:of nodeswatching ZK. Assuming that there will be many more clients than the nodes themselves,

          • There will be too many watchers on the main clusterstate
          • If there are are multiple state nodes , clients would need to watch those ZK nodes as well

          We want to minimize the load on ZK . Moreover, all clients don't need to be aware of all collections and their states. It can be fetched lazily on demand

          This will be another class extending SolrServer

          class LazyCloudSolrServer extends SolrServer {
          }
          
          
          Show
          Noble Paul added a comment - The parent issue is to enable Solr to deal with a large no:of collections and hence we are going to see an explosion of the no:of nodeswatching ZK. Assuming that there will be many more clients than the nodes themselves, There will be too many watchers on the main clusterstate If there are are multiple state nodes , clients would need to watch those ZK nodes as well We want to minimize the load on ZK . Moreover, all clients don't need to be aware of all collections and their states. It can be fetched lazily on demand This will be another class extending SolrServer class LazyCloudSolrServer extends SolrServer { }
          Hide
          Timothy Potter added a comment -

          I like the strategy of using a LazyCloudSolrServer as that keeps this "lazy" mode de-coupled from the current implementation, giving Solr users the choice based on their environment. However, while planning an implementation strategy for this, I discovered that some of the core cloud classes in the org.apache.solr.common.cloud package, such as CloudSolrServer and ZkStateReader, are closed for extension because all of the member variables one would need access to are marked private. It seems reasonable that one would want to extend these classes, if for some other reason than this ticket alone. Curious on whether I can plan to mark the members of the core classes protected instead of private or should an implementation of this ticket leave all existing cloud code un-touched even if that leads to copy-and-paste reuse in some cases? Immediately, I can see needing to change some access modifiers for CloudSolrServer, ZkStateReader, and ClusterState.

          Show
          Timothy Potter added a comment - I like the strategy of using a LazyCloudSolrServer as that keeps this "lazy" mode de-coupled from the current implementation, giving Solr users the choice based on their environment. However, while planning an implementation strategy for this, I discovered that some of the core cloud classes in the org.apache.solr.common.cloud package, such as CloudSolrServer and ZkStateReader, are closed for extension because all of the member variables one would need access to are marked private. It seems reasonable that one would want to extend these classes, if for some other reason than this ticket alone. Curious on whether I can plan to mark the members of the core classes protected instead of private or should an implementation of this ticket leave all existing cloud code un-touched even if that leads to copy-and-paste reuse in some cases? Immediately, I can see needing to change some access modifiers for CloudSolrServer, ZkStateReader, and ClusterState.
          Hide
          Mark Miller added a comment -

          If it's for our own internal use and helps us, feel free to open those classes up. I would just perhaps note in a comment that they are not intended for user extension - because we do not want to be locked into any internal back compat for those classes at this point I think.

          Show
          Mark Miller added a comment - If it's for our own internal use and helps us, feel free to open those classes up. I would just perhaps note in a comment that they are not intended for user extension - because we do not want to be locked into any internal back compat for those classes at this point I think.
          Hide
          Timothy Potter added a comment -

          I'm going to try to see how far I can get without changing the existing code, but good to know I can change access levels if I get stuck Thanks.

          Show
          Timothy Potter added a comment - I'm going to try to see how far I can get without changing the existing code, but good to know I can change access levels if I get stuck Thanks.
          Hide
          Noble Paul added a comment - - edited

          As part of SOLR-5473 the ZKStateReader will have an option of not watching clusterstate.json. This option could be enabled with a flag. The LazyCloudSolrServer can piggyback on the new ZkStateReader

          The _target_ param would help the client know of a node goes down . But if a new replica comes up , the client would not know about the new replica till the cache gets invalidated by a

          • timeout
          • INAVLID_NODE from the server
          • connection timeout to a replica of any node in that colection

          An alternative would be the send the zk version as a parameter . example _stateVer_=collx:234 . If there is a version mismatch or the node does not serve the 'collx' respond with INVALID_STATE error

          The DocCollection Object would have a new method getZkVersion() which gives the current version of the zk Object from which the DocCollection is created

          Show
          Noble Paul added a comment - - edited As part of SOLR-5473 the ZKStateReader will have an option of not watching clusterstate.json. This option could be enabled with a flag. The LazyCloudSolrServer can piggyback on the new ZkStateReader The _target_ param would help the client know of a node goes down . But if a new replica comes up , the client would not know about the new replica till the cache gets invalidated by a timeout INAVLID_NODE from the server connection timeout to a replica of any node in that colection An alternative would be the send the zk version as a parameter . example _stateVer_=collx:234 . If there is a version mismatch or the node does not serve the 'collx' respond with INVALID_STATE error The DocCollection Object would have a new method getZkVersion() which gives the current version of the zk Object from which the DocCollection is created
          Hide
          Timothy Potter added a comment -

          Thanks for the info about changes to ZkStateReader for SOLR-5473. I'm trying to think about how to differentiate between downed nodes and slow queries using this approach.

          Let's consider the scenario where there are two nodes serving a shard (A & B) and LazyCloudSolrServer sends a query request to node A. Imagine that node A is down, but the client application doesn't know that yet because its cached state is stale. The request will timeout after some configurable duration. After the timeout, LazyCloudSolrServer refreshes the cached state and realizes node A is down so it sends the request to node B and the query succeeds.

          However, if node A is actually healthy and the cause of the timeout is a slow query, then the client should have waited longer. After refreshing the state from ZooKeeper (in response to the timeout), the client can realize that since A was healthy, the cause of the timeout was likely a slow query. So does client re-send the slow query? That seems like it could end up in a loop of timeout / resends. Does LazyCloudSolrServer keep track of how many attempts it's made for a given query ... just brainstorming here ... I know Solr supports the timeAllowed parameter for a query but that's optional.

          I suppose this scenario is still possible even with the current approach of having watcher on the state znode on the client side. Although, I have to think that under the current approach, the probability of sending a request to a downed node goes down since state is refreshed in real-time. The zk version doesn't help here because if node A is down, the only thing the client can do is wait for the request to timeout.

          Show
          Timothy Potter added a comment - Thanks for the info about changes to ZkStateReader for SOLR-5473 . I'm trying to think about how to differentiate between downed nodes and slow queries using this approach. Let's consider the scenario where there are two nodes serving a shard (A & B) and LazyCloudSolrServer sends a query request to node A. Imagine that node A is down, but the client application doesn't know that yet because its cached state is stale. The request will timeout after some configurable duration. After the timeout, LazyCloudSolrServer refreshes the cached state and realizes node A is down so it sends the request to node B and the query succeeds. However, if node A is actually healthy and the cause of the timeout is a slow query, then the client should have waited longer. After refreshing the state from ZooKeeper (in response to the timeout), the client can realize that since A was healthy, the cause of the timeout was likely a slow query. So does client re-send the slow query? That seems like it could end up in a loop of timeout / resends. Does LazyCloudSolrServer keep track of how many attempts it's made for a given query ... just brainstorming here ... I know Solr supports the timeAllowed parameter for a query but that's optional. I suppose this scenario is still possible even with the current approach of having watcher on the state znode on the client side. Although, I have to think that under the current approach, the probability of sending a request to a downed node goes down since state is refreshed in real-time. The zk version doesn't help here because if node A is down, the only thing the client can do is wait for the request to timeout.
          Hide
          Noble Paul added a comment -

          Before updating the cache it should check the version of "state" for that collection. if it finds that the version is actually not changed it should just continue talking to other nodes in the shard and not invalidate the cache

          Show
          Noble Paul added a comment - Before updating the cache it should check the version of "state" for that collection. if it finds that the version is actually not changed it should just continue talking to other nodes in the shard and not invalidate the cache
          Hide
          Timothy Potter added a comment -

          As Noble mentions above, this approach needs a time-to-live (TTL) timeout for cached collection state. For this, I'm thinking of keeping an additional long property along with the cached state that holds the timestamp of when the state was last loaded from ZooKeeper. An age is computed when objects are requested from the cache and if the age is older than a configured TTL, then the object is removed from the cache and null is returned. Thinking something like the following internal private class ...

          class CachedDocCollection {
          private DocCollection cached;
          private long cachedAt;

          boolean isExpired(long ttl)

          { return (System.currentTimeMillis() - cachedAt) > ttl; }

          }

          This approach has the benefit of not needing any new dependencies (such as Guava's cache) and doesn't require a background thread to do cache housekeeping.

          This has the downside of computing an age on each "get", which I don't think is a big deal because state is probably not going to be requested all that frequently.

          Show
          Timothy Potter added a comment - As Noble mentions above, this approach needs a time-to-live (TTL) timeout for cached collection state. For this, I'm thinking of keeping an additional long property along with the cached state that holds the timestamp of when the state was last loaded from ZooKeeper. An age is computed when objects are requested from the cache and if the age is older than a configured TTL, then the object is removed from the cache and null is returned. Thinking something like the following internal private class ... class CachedDocCollection { private DocCollection cached; private long cachedAt; boolean isExpired(long ttl) { return (System.currentTimeMillis() - cachedAt) > ttl; } } This approach has the benefit of not needing any new dependencies (such as Guava's cache) and doesn't require a background thread to do cache housekeeping. This has the downside of computing an age on each "get", which I don't think is a big deal because state is probably not going to be requested all that frequently.
          Hide
          Timothy Potter added a comment -

          Here's the start of a solution for this ticket. This first patch contains a LazyCloudSolrServer implementation + unit tests and changes to ZkStateReader to not watch any znodes. More functional unit tests are needed as is some chaos monkey type tests to verify that the lazy approach can respond to all possible cluster state changes correctly.

          Show
          Timothy Potter added a comment - Here's the start of a solution for this ticket. This first patch contains a LazyCloudSolrServer implementation + unit tests and changes to ZkStateReader to not watch any znodes. More functional unit tests are needed as is some chaos monkey type tests to verify that the lazy approach can respond to all possible cluster state changes correctly.
          Hide
          Timothy Potter added a comment -

          Previous patch file was created incorrectly with git ...

          Show
          Timothy Potter added a comment - Previous patch file was created incorrectly with git ...
          Hide
          Timothy Potter added a comment -

          Here's a new patch that takes a slightly different approach to the previous. The previous did the following:

          1 – Client application creates a request targeted to external collection "foo"

          2 - CloudSolrServer (on the client side) doesn't know about "foo", so it fetches a one-time snapshot of foo's state from ZK using lazy loading. It caches that state and keeps track of the state version, e.g. 1

          3 - CloudSolrServer sends the request to one of the nodes servicing “foo” based on the state information it retrieved from ZK. If the request is an update request, it will go to the leader, if it is a query, the request will go to one of the replicas using LBSolrServer. Every request contains the stateVer parameter, e.g. stateVer=foo:1

          4 - Server-side compares the stateVer it receives in the request from the client with its stateVer for foo and generates an INVALID_STATE error if they don't match. The server does have a watcher for foo’s state in each replica.

          There are some subtle issues with this:

          1) If a new replica is added (or recovers) in "foo", then the state of "foo" on the server changes and the request fails with an INVALID_STATE even though it probably shouldn't fail, but that's the only way now to tell the client its state is stale.

          There is retry logic in the client and the retry may work, but it might not because there's nothing that prevents the state from changing again in between the client receiving the INVALID_STATE response, re-fetching state from ZK, and re-issuing the request. Also, failing a request when a positive state change occurs (e.g. adding a replica) just to invalidate cache seems problematic to me. In other words, the state of a collection has changed, but in a positive way that shouldn’t lead to a request failing. Of course with the correct amount of retries, the request will likely work in the end but one can envision a number of network round-trips between the client and server just to respond to potentially benign state changes.

          2) Since the client-side is not "watching" any znodes, it runs the risk of trying to send a request to a node that is no longer live. Currently, the CloudSolrServer consults /live_nodes to make sure a node is "live" before it attempts to send a request to it. Without watchers, the client side has no way of knowing a node isn't "live" until an error occurs. So now it has to wait for some time for the request to timeout and then refresh /live_nodes from ZooKeeper.

          3) Aliases – what happens if a collection is added to an alias? Without watchers, the client won’t know the alias changed. I’m sure we could implement a similar stateVer solution for aliases but that seems less elegant than just watching the znode.

          4) Queries that span multiple collections … I think problems #1 and 2 mentioned above just get worse when dealing with queries that span multiple collections.

          So based on my discussions with Noble, the new patch takes the following approach:

          1) No more LazyCloudSolrServer; just adding support for external collections in CloudSolrServer

          2) Still watch shared znodes, such as /aliases and /live_nodes

          3) State for external collections loaded on demand and cached

          As it stands now, the CloudSolrServer does not watch external collections when running on the client side. The idea there being there may be too many external collections to setup watchers for. Thus, state is requested on demand and cached. This opens the door for the cached state to go stale, leading to an INVALID_STATE error.

          However, this presents the need for a new public method on ZkStateReader (currently named refreshAllCollectionsOnClusterStateChange), which refreshes the internal allCollections set containing the names of internal (those in /clusterstate.json and external). While this approach works, it seems like an external object telling an internal object to fix itself, which is somewhat anti-OO. One improvement would be to dynamically update allCollections when a new external collection is discovered. Please advise.

          External collections get watched on the server-side only, which gets setup by the ZkController. So client-side uses of CloudSolrServer will not have watchers setup for external collections.

          The remaining issue with this patch is how to handle requests that span multiple external collections as the stateVer parameter only supports a single collection at this time. A simple comma-delimited list of collection:ver pairs could be passed and the server could check each one. However, the test case for multiple collections is not passing and is commented out currently. Next patch will address that issue.

          Show
          Timothy Potter added a comment - Here's a new patch that takes a slightly different approach to the previous. The previous did the following: 1 – Client application creates a request targeted to external collection "foo" 2 - CloudSolrServer (on the client side) doesn't know about "foo", so it fetches a one-time snapshot of foo's state from ZK using lazy loading. It caches that state and keeps track of the state version, e.g. 1 3 - CloudSolrServer sends the request to one of the nodes servicing “foo” based on the state information it retrieved from ZK. If the request is an update request, it will go to the leader, if it is a query, the request will go to one of the replicas using LBSolrServer. Every request contains the stateVer parameter, e.g. stateVer =foo:1 4 - Server-side compares the stateVer it receives in the request from the client with its stateVer for foo and generates an INVALID_STATE error if they don't match. The server does have a watcher for foo’s state in each replica. There are some subtle issues with this: 1) If a new replica is added (or recovers) in "foo", then the state of "foo" on the server changes and the request fails with an INVALID_STATE even though it probably shouldn't fail, but that's the only way now to tell the client its state is stale. There is retry logic in the client and the retry may work, but it might not because there's nothing that prevents the state from changing again in between the client receiving the INVALID_STATE response, re-fetching state from ZK, and re-issuing the request. Also, failing a request when a positive state change occurs (e.g. adding a replica) just to invalidate cache seems problematic to me. In other words, the state of a collection has changed, but in a positive way that shouldn’t lead to a request failing. Of course with the correct amount of retries, the request will likely work in the end but one can envision a number of network round-trips between the client and server just to respond to potentially benign state changes. 2) Since the client-side is not "watching" any znodes, it runs the risk of trying to send a request to a node that is no longer live. Currently, the CloudSolrServer consults /live_nodes to make sure a node is "live" before it attempts to send a request to it. Without watchers, the client side has no way of knowing a node isn't "live" until an error occurs. So now it has to wait for some time for the request to timeout and then refresh /live_nodes from ZooKeeper. 3) Aliases – what happens if a collection is added to an alias? Without watchers, the client won’t know the alias changed. I’m sure we could implement a similar stateVer solution for aliases but that seems less elegant than just watching the znode. 4) Queries that span multiple collections … I think problems #1 and 2 mentioned above just get worse when dealing with queries that span multiple collections. So based on my discussions with Noble, the new patch takes the following approach: 1) No more LazyCloudSolrServer; just adding support for external collections in CloudSolrServer 2) Still watch shared znodes, such as /aliases and /live_nodes 3) State for external collections loaded on demand and cached As it stands now, the CloudSolrServer does not watch external collections when running on the client side. The idea there being there may be too many external collections to setup watchers for. Thus, state is requested on demand and cached. This opens the door for the cached state to go stale, leading to an INVALID_STATE error. However, this presents the need for a new public method on ZkStateReader (currently named refreshAllCollectionsOnClusterStateChange), which refreshes the internal allCollections set containing the names of internal (those in /clusterstate.json and external). While this approach works, it seems like an external object telling an internal object to fix itself, which is somewhat anti-OO. One improvement would be to dynamically update allCollections when a new external collection is discovered. Please advise. External collections get watched on the server-side only, which gets setup by the ZkController. So client-side uses of CloudSolrServer will not have watchers setup for external collections. The remaining issue with this patch is how to handle requests that span multiple external collections as the stateVer parameter only supports a single collection at this time. A simple comma-delimited list of collection:ver pairs could be passed and the server could check each one. However, the test case for multiple collections is not passing and is commented out currently. Next patch will address that issue.
          Hide
          Timothy Potter added a comment -

          Updated patch with unit test passing. The one remaining issue I can see is that CloudSolrServer is having to call refreshAllCollectionsOnClusterStateChange on the ZkStateReader, which I believe Noble mentioned as that being due to a possible bug in how allCollections are kept up-to-date for SOLR-5473.

          Show
          Timothy Potter added a comment - Updated patch with unit test passing. The one remaining issue I can see is that CloudSolrServer is having to call refreshAllCollectionsOnClusterStateChange on the ZkStateReader, which I believe Noble mentioned as that being due to a possible bug in how allCollections are kept up-to-date for SOLR-5473 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1587834 from noble@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1587834 ]

          SOLR-5473: Make one state.json per collection , SOLR-5474: Have a new mode for SolrJ to support stateFormat=2

          Show
          ASF subversion and git services added a comment - Commit 1587834 from noble@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1587834 ] SOLR-5473 : Make one state.json per collection , SOLR-5474 : Have a new mode for SolrJ to support stateFormat=2
          Hide
          Mark Miller added a comment - - edited

          SOLR-5474: Have a new mode for SolrJ to support stateFormat=2

          I don't fully understand the CHANGES entry for this. Really, we should say CloudSolrServer, not SolrJ?

          Also, stateFormat=2 means that clusterstate.json is split out by collection right? And support for that is in ZkStateReader, which SolrJ uses.

          Does this issue give CloudSolrServer support for stateFormat=2, or does it add a new mode of operation where watchers are not used to monitor zk state?

          Show
          Mark Miller added a comment - - edited SOLR-5474 : Have a new mode for SolrJ to support stateFormat=2 I don't fully understand the CHANGES entry for this. Really, we should say CloudSolrServer, not SolrJ? Also, stateFormat=2 means that clusterstate.json is split out by collection right? And support for that is in ZkStateReader, which SolrJ uses. Does this issue give CloudSolrServer support for stateFormat=2, or does it add a new mode of operation where watchers are not used to monitor zk state?
          Hide
          Noble Paul added a comment - - edited

          Really, we should say CloudSolrServer, not SolrJ?

          we could. But originally the plan was different and things changed a lot when we started implementing it

          And support for that is in ZkStateReader....

          Yes, But ZkStateReader would fetch the collection details from ZK if it is nnot a 'watched collectiion" . For a client watching a collection is not really an option. SO the caching support is added outside of ZkStateReader and , right into CLoudSolrServer

          Does this issue give CloudSolrServer support for stateFormat=2

          Actually the user does not have to specify anything. CloudSolrServer would automatically handle stateFormat=2 .

          The definition of the ticket is slightly misleading ,because the original design was different

          Show
          Noble Paul added a comment - - edited Really, we should say CloudSolrServer, not SolrJ? we could. But originally the plan was different and things changed a lot when we started implementing it And support for that is in ZkStateReader.... Yes, But ZkStateReader would fetch the collection details from ZK if it is nnot a 'watched collectiion" . For a client watching a collection is not really an option. SO the caching support is added outside of ZkStateReader and , right into CLoudSolrServer Does this issue give CloudSolrServer support for stateFormat=2 Actually the user does not have to specify anything. CloudSolrServer would automatically handle stateFormat=2 . The definition of the ticket is slightly misleading ,because the original design was different
          Hide
          Mark Miller added a comment -

          I guess what I meant to say is that it's pretty useless CHANGES entry for a user. I'm not talking about design, or implementation or anything. I'm trying to pull out of you what was done here, so you can realize, the CHANGES entry does a pretty bad job reflecting it.

          Have a new mode for SolrJ to support stateFormat=2

          SolrJ does not have a new mode. The CloudSolrServer SolrServer implementation that is part of SolrJ has a new mode.

          SO the caching support is added outside of ZkStateReader and , right into CLoudSolrServer

          The CHANGES entry says nothing about "caching" support - it says support for stateFormat=2, which CloudSolrServer and SolrJ have without this issue! You could already say SolrJ supports stateFormat=2 as I said - it's a very general statement and so means nothing saying support for it was added here.

          stateFormat=2 support is meaningless to any user in any case. We should try and make CHANGES entries that have value to the user by trying to write something that they can understand. If a user reads this entry, they have no clue what we did or how they might use it or why it even matters at all. CHANGES is for the user and not the devs that implemented the issue.

          Show
          Mark Miller added a comment - I guess what I meant to say is that it's pretty useless CHANGES entry for a user. I'm not talking about design, or implementation or anything. I'm trying to pull out of you what was done here, so you can realize, the CHANGES entry does a pretty bad job reflecting it. Have a new mode for SolrJ to support stateFormat=2 SolrJ does not have a new mode. The CloudSolrServer SolrServer implementation that is part of SolrJ has a new mode. SO the caching support is added outside of ZkStateReader and , right into CLoudSolrServer The CHANGES entry says nothing about "caching" support - it says support for stateFormat=2, which CloudSolrServer and SolrJ have without this issue! You could already say SolrJ supports stateFormat=2 as I said - it's a very general statement and so means nothing saying support for it was added here. stateFormat=2 support is meaningless to any user in any case. We should try and make CHANGES entries that have value to the user by trying to write something that they can understand. If a user reads this entry, they have no clue what we did or how they might use it or why it even matters at all. CHANGES is for the user and not the devs that implemented the issue.
          Hide
          Noble Paul added a comment -

          Actually , you are right.

          As I said earlier , the original concept was completely different and the ticket was created with that in mind. If the design is as it is implemented today, I would have NOT created this ticket and it would have been a part of SOLR-5473 . My recommendation would be to just remove the entry from CHANGES.txt and to add the credits for Timothy Potter be added to SOLR-5473 because he is the one who implemented this.

          Show
          Noble Paul added a comment - Actually , you are right. As I said earlier , the original concept was completely different and the ticket was created with that in mind. If the design is as it is implemented today, I would have NOT created this ticket and it would have been a part of SOLR-5473 . My recommendation would be to just remove the entry from CHANGES.txt and to add the credits for Timothy Potter be added to SOLR-5473 because he is the one who implemented this.
          Hide
          Mark Miller added a comment -

          This does seem like a discrete issue that Tim worked on though. It does seem to deserve it's own CHANGES entry, it could just be a little more user friendly.

          Can we do something along the lines of:

          SOLR-5381: Add a new mode to CloudSolrServer that scales better with a large number of collections by attempting to only update cached cluster state when necessary rather than on any change.
          (Tim Potter, Noble Paul) *
          
          • Personally, I'd lead with Tim since he drove the code on this issue.
          Show
          Mark Miller added a comment - This does seem like a discrete issue that Tim worked on though. It does seem to deserve it's own CHANGES entry, it could just be a little more user friendly. Can we do something along the lines of: SOLR-5381: Add a new mode to CloudSolrServer that scales better with a large number of collections by attempting to only update cached cluster state when necessary rather than on any change. (Tim Potter, Noble Paul) * Personally, I'd lead with Tim since he drove the code on this issue.
          Hide
          Noble Paul added a comment -

          Take 2

          SOLR-5474: CloudSolrServer enhancements to scale to a large no:of collections where relevant collection states are cached and updated when necessary instead of updating real-time (Tim Potter, Noble Paul)
          
          Show
          Noble Paul added a comment - Take 2 SOLR-5474: CloudSolrServer enhancements to scale to a large no:of collections where relevant collection states are cached and updated when necessary instead of updating real-time (Tim Potter, Noble Paul)
          Hide
          Mark Miller added a comment -

          Attached is a test fail for this.

          org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: STATE STALE: collection1:15valid : false
          	at __randomizedtesting.SeedInfo.seed([170B85B5FF6722D6:96ED0BAD883842EA]:0)
          	at org.apache.solr.client.solrj.impl.CloudSolrServer.requestWithRetryOnStaleState(CloudSolrServer.java:683)
          	at org.apache.solr.client.solrj.impl.CloudSolrServer.requestWithRetryOnStaleState(CloudSolrServer.java:676)
          	at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:556)
          	at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
          	at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
          	at org.apache.solr.cloud.AbstractFullDistribZkTestBase.queryServer(AbstractFullDistribZkTestBase.java:1292)
          	at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:558)
          	at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:540)
          	at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:519)
          	at org.apache.solr.cloud.BasicDistributedZk2Test.brindDownShardIndexSomeDocsAndRecover(BasicDistributedZk2Test.java:291)
          	at org.apache.solr.cloud.BasicDistributedZk2Test.doTest(BasicDistributedZk2Test.java:109)
          	at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          
          Show
          Mark Miller added a comment - Attached is a test fail for this. org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: STATE STALE: collection1:15valid : false at __randomizedtesting.SeedInfo.seed([170B85B5FF6722D6:96ED0BAD883842EA]:0) at org.apache.solr.client.solrj.impl.CloudSolrServer.requestWithRetryOnStaleState(CloudSolrServer.java:683) at org.apache.solr.client.solrj.impl.CloudSolrServer.requestWithRetryOnStaleState(CloudSolrServer.java:676) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:556) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) at org.apache.solr.cloud.AbstractFullDistribZkTestBase.queryServer(AbstractFullDistribZkTestBase.java:1292) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:558) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:540) at org.apache.solr.BaseDistributedSearchTestCase.query(BaseDistributedSearchTestCase.java:519) at org.apache.solr.cloud.BasicDistributedZk2Test.brindDownShardIndexSomeDocsAndRecover(BasicDistributedZk2Test.java:291) at org.apache.solr.cloud.BasicDistributedZk2Test.doTest(BasicDistributedZk2Test.java:109) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          Hide
          Noble Paul added a comment -

          Actually , the current implementation retries only once if the state is stale, In reality it is possible that the state changed again after the client refreshed the state and it gets a error again with the same message.

          Timothy Potter
          Instead of trying only twice , it should try a higher no:of times ,say 5 ,

          Show
          Noble Paul added a comment - Actually , the current implementation retries only once if the state is stale, In reality it is possible that the state changed again after the client refreshed the state and it gets a error again with the same message. Timothy Potter Instead of trying only twice , it should try a higher no:of times ,say 5 ,
          Hide
          ASF subversion and git services added a comment -

          Commit 1590983 from Noble Paul in branch 'dev/branches/solr5473'
          [ https://svn.apache.org/r1590983 ]

          Creating a branch to sort out SOLR-5473 , SOLR-5474

          Show
          ASF subversion and git services added a comment - Commit 1590983 from Noble Paul in branch 'dev/branches/solr5473' [ https://svn.apache.org/r1590983 ] Creating a branch to sort out SOLR-5473 , SOLR-5474
          Hide
          ASF subversion and git services added a comment -

          Commit 1591253 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1591253 ]

          revert SOLR-5473 , SOLR-5474

          Show
          ASF subversion and git services added a comment - Commit 1591253 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1591253 ] revert SOLR-5473 , SOLR-5474
          Hide
          ASF subversion and git services added a comment -

          Commit 1603938 from Noble Paul in branch 'dev/branches/solr-5473'
          [ https://svn.apache.org/r1603938 ]

          latest patch applied to trunk for SOLR-5473 SOLR-5474

          Show
          ASF subversion and git services added a comment - Commit 1603938 from Noble Paul in branch 'dev/branches/solr-5473' [ https://svn.apache.org/r1603938 ] latest patch applied to trunk for SOLR-5473 SOLR-5474
          Hide
          ASF subversion and git services added a comment -

          Commit 1607418 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1607418 ]

          SOLR-5473 one state.json per collection , SOLR-5474 support for one json per collection

          Show
          ASF subversion and git services added a comment - Commit 1607418 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1607418 ] SOLR-5473 one state.json per collection , SOLR-5474 support for one json per collection
          Hide
          ASF subversion and git services added a comment -

          Commit 1607587 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1607587 ]

          reverting SOLR-5473 , SOLR-5474

          Show
          ASF subversion and git services added a comment - Commit 1607587 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1607587 ] reverting SOLR-5473 , SOLR-5474
          Hide
          Noble Paul added a comment -

          when SOLR-5473 got reverted , this too got reverted

          Show
          Noble Paul added a comment - when SOLR-5473 got reverted , this too got reverted
          Hide
          ASF subversion and git services added a comment -

          Commit 1624556 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1624556 ]

          split clusterstate.json SOLR-5473, SOLR-5474, SOLR-5810

          Show
          ASF subversion and git services added a comment - Commit 1624556 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1624556 ] split clusterstate.json SOLR-5473 , SOLR-5474 , SOLR-5810
          Hide
          ASF subversion and git services added a comment -

          Commit 1624650 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1624650 ]

          SOLR-5473, SOLR-5474, SOLR-5810 ... NO SVN KEYWORDS! ! !

          Show
          ASF subversion and git services added a comment - Commit 1624650 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1624650 ] SOLR-5473 , SOLR-5474 , SOLR-5810 ... NO SVN KEYWORDS! ! !
          Hide
          ASF subversion and git services added a comment -

          Commit 1626534 from Noble Paul in branch 'dev/trunk'
          [ https://svn.apache.org/r1626534 ]

          SOLR-5474 NPE in tests

          Show
          ASF subversion and git services added a comment - Commit 1626534 from Noble Paul in branch 'dev/trunk' [ https://svn.apache.org/r1626534 ] SOLR-5474 NPE in tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1626535 from Noble Paul in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1626535 ]

          SOLR-5474 NPE in tests

          Show
          ASF subversion and git services added a comment - Commit 1626535 from Noble Paul in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1626535 ] SOLR-5474 NPE in tests
          Hide
          ASF subversion and git services added a comment -

          Commit 1626602 from Use account "steve_rowe" instead in branch 'dev/trunk'
          [ https://svn.apache.org/r1626602 ]

          SOLR-5474 Fix NPE in tests (some more)

          Show
          ASF subversion and git services added a comment - Commit 1626602 from Use account "steve_rowe" instead in branch 'dev/trunk' [ https://svn.apache.org/r1626602 ] SOLR-5474 Fix NPE in tests (some more)
          Hide
          ASF subversion and git services added a comment -

          Commit 1626603 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1626603 ]

          SOLR-5474 Fix NPE in tests (some more) (merged trunk r1626602)

          Show
          ASF subversion and git services added a comment - Commit 1626603 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1626603 ] SOLR-5474 Fix NPE in tests (some more) (merged trunk r1626602)
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Noble Paul
              Reporter:
              Noble Paul
            • Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development