Solr
  1. Solr
  2. SOLR-6249

Schema API changes return success before all cores are updated

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Labels:
      None

      Description

      See SOLR-6137 for more details.

      The basic issue is that Schema API changes return success when the first core is updated, but other cores asynchronously read the updated schema from ZooKeeper.

      So a client application could make a Schema API change and then index some documents based on the new schema that may fail on other nodes.

      Possible fixes:
      1) Make the Schema API calls synchronous
      2) Give the client some ability to track the state of the schema. They can already do this to a certain extent by checking the Schema API on all the replicas and verifying that the field has been added, though this is pretty cumbersome. Maybe it makes more sense to do this sort of thing on the collection level, i.e. Schema API changes return the zk version to the client. We add an API to return the current zk version. On a replica, if the zk version is >= the version the client has, the client knows that replica has at least seen the schema change. We could also provide an API to do the distribution and checking across the different replicas of the collection so that clients don't need ot do that themselves.

      1. SOLR-6249_reconnect.patch
        12 kB
        Timothy Potter
      2. SOLR-6249_reconnect.patch
        6 kB
        Timothy Potter
      3. SOLR-6249.patch
        50 kB
        Timothy Potter
      4. SOLR-6249.patch
        43 kB
        Timothy Potter
      5. SOLR-6249.patch
        41 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          Timothy Potter added a comment -

          Going to start working on this ... a few initial thoughts:

          ZkIndexSchemaReader has a ZK watcher to receive notification when the schema is updated. So we're talking about a very small window of time between the updated schema being written and all replicas seeing the update. Consequently I think it's reasonable for the core that accepted the API request to block with a reasonable timeout to give all the active replicas time to apply the update. In other words, I don't think we should put the burden on the client to assess the success / failure of the operation across all replicas. I'll need to figure out what to block on but ZooKeeper's two-phase commit recipe sounds applicable here in that the tx coordinator (the core that accepted the API request) can block until all other replicas ack that they've applied the updates successfully. Alternatively, the coordinator could just poll each active replica for the schema version it is using (along the lines of what Gregory suggested above) with a maximum amount of time the coordinator is allowed to keep polling replicas.

          I also think a replica that cannot process the update successfully should be put into the down state (so as to prevent it from receiving update/query requests). The replica should initiate this action itself if the update fails so that we don't have replicas with mixed schemas running in the cluster. I'll need to dig into the ramifications of that, but my thinking here is if 1 replica applies the update successfully and another fails, then we probably still want the request to succeed from the client perspective and just put the replica having problems into the down state. I prefer this over the approach where an update must succeed on all "active" replicas or fail entirely, which gets us into the realm of distributed transactions for these types of updates, which is now hard because the write to ZooKeeper has already occurred (requiring compensating transaction type solutions where we'd need to back out the write).

          Show
          Timothy Potter added a comment - Going to start working on this ... a few initial thoughts: ZkIndexSchemaReader has a ZK watcher to receive notification when the schema is updated. So we're talking about a very small window of time between the updated schema being written and all replicas seeing the update. Consequently I think it's reasonable for the core that accepted the API request to block with a reasonable timeout to give all the active replicas time to apply the update. In other words, I don't think we should put the burden on the client to assess the success / failure of the operation across all replicas. I'll need to figure out what to block on but ZooKeeper's two-phase commit recipe sounds applicable here in that the tx coordinator (the core that accepted the API request) can block until all other replicas ack that they've applied the updates successfully. Alternatively, the coordinator could just poll each active replica for the schema version it is using (along the lines of what Gregory suggested above) with a maximum amount of time the coordinator is allowed to keep polling replicas. I also think a replica that cannot process the update successfully should be put into the down state (so as to prevent it from receiving update/query requests). The replica should initiate this action itself if the update fails so that we don't have replicas with mixed schemas running in the cluster. I'll need to dig into the ramifications of that, but my thinking here is if 1 replica applies the update successfully and another fails, then we probably still want the request to succeed from the client perspective and just put the replica having problems into the down state. I prefer this over the approach where an update must succeed on all "active" replicas or fail entirely, which gets us into the realm of distributed transactions for these types of updates, which is now hard because the write to ZooKeeper has already occurred (requiring compensating transaction type solutions where we'd need to back out the write).
          Hide
          Timothy Potter added a comment -

          Just jotting down more notes ...

          Forgot that a managed-schema can be shared across many collections (potentially hundreds) so that complicates the blocking idea. We could have the API call block until a random active replica returns the expected zkversion using a new endpoint I'm adding /schema/zkversion (with some timeout built-in to avoid blocking too long). The idea here is that if 1 other replica has the update, then likely all do, which is a little loose but this is just to alleviate the need for a client application to add any polling code in the client side. In other words, we'll block for a short amount of time (just enough that another replica has the update) which should be sufficient and allow the client to proceed with using the update.

          Looked into the two-phase commit recipe in ZK and that seems like overkill for this.

          Seems to me like less coordination here is best? In other words, instead of the core that accepts the update request worrying about all the other cores that are going to see the update, it just assumes it will be successful. If any of the remote cores fail in processing the update, then they mark themselves as "down". This seems to solve the root of the problem raised by this ticket, namely a replica using the wrong version of a managed schema. It's likely that a replica that can't process a good update successfully will need some manual intervention anyway (such as missing a JAR file or something).

          Show
          Timothy Potter added a comment - Just jotting down more notes ... Forgot that a managed-schema can be shared across many collections (potentially hundreds) so that complicates the blocking idea. We could have the API call block until a random active replica returns the expected zkversion using a new endpoint I'm adding /schema/zkversion (with some timeout built-in to avoid blocking too long). The idea here is that if 1 other replica has the update, then likely all do, which is a little loose but this is just to alleviate the need for a client application to add any polling code in the client side. In other words, we'll block for a short amount of time (just enough that another replica has the update) which should be sufficient and allow the client to proceed with using the update. Looked into the two-phase commit recipe in ZK and that seems like overkill for this. Seems to me like less coordination here is best? In other words, instead of the core that accepts the update request worrying about all the other cores that are going to see the update, it just assumes it will be successful. If any of the remote cores fail in processing the update, then they mark themselves as "down". This seems to solve the root of the problem raised by this ticket, namely a replica using the wrong version of a managed schema. It's likely that a replica that can't process a good update successfully will need some manual intervention anyway (such as missing a JAR file or something).
          Hide
          Gregory Chanan added a comment -

          Forgot that a managed-schema can be shared across many collections (potentially hundreds) so that complicates the blocking idea

          Do we need to support that? It seems like an anti-pattern to me; I did it the first time I tried to use a managed schema and got confused why all my schemas were changing when I "just" changed one. I'm sure other people have had the same experience. I'm not sure how much that changes your analysis, just a thought.

          Show
          Gregory Chanan added a comment - Forgot that a managed-schema can be shared across many collections (potentially hundreds) so that complicates the blocking idea Do we need to support that? It seems like an anti-pattern to me; I did it the first time I tried to use a managed schema and got confused why all my schemas were changing when I "just" changed one. I'm sure other people have had the same experience. I'm not sure how much that changes your analysis, just a thought.
          Hide
          Steve Rowe added a comment -

          The idea here is that if 1 other replica has the update, then likely all do, which is a little loose but this is just to alleviate the need for a client application to add any polling code in the client side.

          Even when all cores hosting shards in a collection are on the same box, as in unit testing, and even though all cores receive schema change notifications within single-digit milliseconds of each other, the time it takes to reload the schema can vary by hundreds of milliseconds.

          Seems to me like less coordination here is best? In other words, instead of the core that accepts the update request worrying about all the other cores that are going to see the update, it just assumes it will be successful.

          But that's the current situation, right?

          If any of the remote cores fail in processing the update, then they mark themselves as "down". This seems to solve the root of the problem raised by this ticket, namely a replica using the wrong version of a managed schema.

          This sounds like a good idea, though we'll have to be careful that a schema update that fails everywhere doesn't bring a collection down.

          As I mentioned above, though, the root of the problem is about versioning differences that result from varying schema load time.

          Show
          Steve Rowe added a comment - The idea here is that if 1 other replica has the update, then likely all do, which is a little loose but this is just to alleviate the need for a client application to add any polling code in the client side. Even when all cores hosting shards in a collection are on the same box, as in unit testing, and even though all cores receive schema change notifications within single-digit milliseconds of each other, the time it takes to reload the schema can vary by hundreds of milliseconds. Seems to me like less coordination here is best? In other words, instead of the core that accepts the update request worrying about all the other cores that are going to see the update, it just assumes it will be successful. But that's the current situation, right? If any of the remote cores fail in processing the update, then they mark themselves as "down". This seems to solve the root of the problem raised by this ticket, namely a replica using the wrong version of a managed schema. This sounds like a good idea, though we'll have to be careful that a schema update that fails everywhere doesn't bring a collection down. As I mentioned above, though, the root of the problem is about versioning differences that result from varying schema load time.
          Hide
          Steve Rowe added a comment -

          the time it takes to reload the schema can vary by hundreds of milliseconds.

          One source of this: I noticed that it takes leaders longer to reload than non-leaders.

          Show
          Steve Rowe added a comment - the time it takes to reload the schema can vary by hundreds of milliseconds. One source of this: I noticed that it takes leaders longer to reload than non-leaders.
          Hide
          Gregory Chanan added a comment - - edited

          Seems to me like less coordination here is best? In other words, instead of the core that accepts the update request worrying about all the other cores that are going to see the update, it just assumes it will be successful. If any of the remote cores fail in processing the update, then they mark themselves as "down". This seems to solve the root of the problem raised by this ticket, namely a replica using the wrong version of a managed schema. It's likely that a replica that can't process a good update successfully will need some manual intervention anyway (such as missing a JAR file or something).

          Sure, less coordination is best. I think there are multiple issues here, though:
          1) how do we handle Schema API modifications failing on some replicas?
          2) if i'm a client and I want to update the schema and send some updates that are dependent on the updated schema, how do I do that? Right now, it's a pain (and without a client-visible version it may get more complicated if we add more schema APIs, e.g. removing fields).

          I don't think the "wait for one replica" idea solves either of those issues.

          Show
          Gregory Chanan added a comment - - edited Seems to me like less coordination here is best? In other words, instead of the core that accepts the update request worrying about all the other cores that are going to see the update, it just assumes it will be successful. If any of the remote cores fail in processing the update, then they mark themselves as "down". This seems to solve the root of the problem raised by this ticket, namely a replica using the wrong version of a managed schema. It's likely that a replica that can't process a good update successfully will need some manual intervention anyway (such as missing a JAR file or something). Sure, less coordination is best. I think there are multiple issues here, though: 1) how do we handle Schema API modifications failing on some replicas? 2) if i'm a client and I want to update the schema and send some updates that are dependent on the updated schema, how do I do that? Right now, it's a pain (and without a client-visible version it may get more complicated if we add more schema APIs, e.g. removing fields). I don't think the "wait for one replica" idea solves either of those issues.
          Hide
          Timothy Potter added a comment -

          the time it takes to reload the schema can vary by hundreds of milliseconds

          Ok, makes sense ... not sure what to do about it ... polling across all replicas seems sub-optimal as well (not to mention it's expensive to even collect that information where there are many collections) ... maybe a happy medium is to check all replicas in the same collection only?

          But that's the current situation, right?

          Close, but with one main difference: if the update fails on one of the replicas (but succeeds on others), the one experiencing the failure still remains "active" and serves requests, i.e. there is no code that actively puts this core into the down state, that will need to be added.

          a schema update that fails everywhere doesn't bring a collection down.

          If it fails on the core that accepts the API request, it won't get written to ZK (thus leaving the other replicas un-affected). I hope that's how it already works but haven't tested that yet.

          Do we need to support that? It seems like an anti-pattern to me;

          I think we do. If you don't want collections sharing configuration, why set it up that way when you create the collections? In other words, if you create two collections and re-use the same configuration, wouldn't you expect updates to propagate to both? It think it is a pretty common thing to share configuration across collections – why do you think this is an anti-pattern?

          Show
          Timothy Potter added a comment - the time it takes to reload the schema can vary by hundreds of milliseconds Ok, makes sense ... not sure what to do about it ... polling across all replicas seems sub-optimal as well (not to mention it's expensive to even collect that information where there are many collections) ... maybe a happy medium is to check all replicas in the same collection only? But that's the current situation, right? Close, but with one main difference: if the update fails on one of the replicas (but succeeds on others), the one experiencing the failure still remains "active" and serves requests, i.e. there is no code that actively puts this core into the down state, that will need to be added. a schema update that fails everywhere doesn't bring a collection down. If it fails on the core that accepts the API request, it won't get written to ZK (thus leaving the other replicas un-affected). I hope that's how it already works but haven't tested that yet. Do we need to support that? It seems like an anti-pattern to me; I think we do. If you don't want collections sharing configuration, why set it up that way when you create the collections? In other words, if you create two collections and re-use the same configuration, wouldn't you expect updates to propagate to both? It think it is a pretty common thing to share configuration across collections – why do you think this is an anti-pattern?
          Hide
          Timothy Potter added a comment -

          how do we handle Schema API modifications failing on some replicas?

          My thinking is that replica has to be marked down (which means it no longer participates in update request processing or query execution). Seems complicated to try to back-out updates on replicas where the changes were successful, but I'm willing to explore that route too if we think it's a better path to go down.

          if i'm a client and I want to update the schema and send some updates that are dependent on the updated schema, how do I do that?

          I'm adding the /schema/zkversion endpoint you suggested, so it will be available to the client. We can also have the server-side code poll for this across the collection (see previous comment)

          Show
          Timothy Potter added a comment - how do we handle Schema API modifications failing on some replicas? My thinking is that replica has to be marked down (which means it no longer participates in update request processing or query execution). Seems complicated to try to back-out updates on replicas where the changes were successful, but I'm willing to explore that route too if we think it's a better path to go down. if i'm a client and I want to update the schema and send some updates that are dependent on the updated schema, how do I do that? I'm adding the /schema/zkversion endpoint you suggested, so it will be available to the client. We can also have the server-side code poll for this across the collection (see previous comment)
          Hide
          Shawn Heisey added a comment -

          I only have one little SolrCloud setup, but right now every one of the collections that it has is using the same configuration set. I would expect that if I update my schema, it would apply the changes to all my collections, because I made the deliberate decision for them to share that configuration. Can our instructions for SolrCloud easily result in a situation where a user has multiple collections sharing a configuration, and is unaware of that fact?

          The only likely situation that I can imagine where a schema will work on one server but not another is if it relies on a class that's loaded from a jar in a local lib directory that doesn't exist on every server. I'm not sure how we would detect that and roll back changes made up to that point.

          Show
          Shawn Heisey added a comment - I only have one little SolrCloud setup, but right now every one of the collections that it has is using the same configuration set. I would expect that if I update my schema, it would apply the changes to all my collections, because I made the deliberate decision for them to share that configuration. Can our instructions for SolrCloud easily result in a situation where a user has multiple collections sharing a configuration, and is unaware of that fact? The only likely situation that I can imagine where a schema will work on one server but not another is if it relies on a class that's loaded from a jar in a local lib directory that doesn't exist on every server. I'm not sure how we would detect that and roll back changes made up to that point.
          Hide
          Gregory Chanan added a comment -

          I only have one little SolrCloud setup, but right now every one of the collections that it has is using the same configuration set. I would expect that if I update my schema, it would apply the changes to all my collections, because I made the deliberate decision for them to share that configuration. Can our instructions for SolrCloud easily result in a situation where a user has multiple collections sharing a configuration, and is unaware of that fact?

          The confusing thing for me was that the schema API calls are on the collection (collectioName/schema), not the configuration. So, it looks like you are just modifying the collection rather than the configuration, and surprising that another collection was modified (i.e. it doesn't matter if I call collection1/schema or collection2/schema if they use the same configuration). I wouldn't have been surprised if the endpoint was on the configuration somehow (not sure how that would work though). If you really understand how everything works under the covers, I guess it makes sense.

          I'm not sure how much this matters either. I guess the question is what we do on failures and whether only the collection endpoint you modify has to succeed or all of the collections that use the configuration have to succeed else we run th failure code.

          Show
          Gregory Chanan added a comment - I only have one little SolrCloud setup, but right now every one of the collections that it has is using the same configuration set. I would expect that if I update my schema, it would apply the changes to all my collections, because I made the deliberate decision for them to share that configuration. Can our instructions for SolrCloud easily result in a situation where a user has multiple collections sharing a configuration, and is unaware of that fact? The confusing thing for me was that the schema API calls are on the collection (collectioName/schema), not the configuration. So, it looks like you are just modifying the collection rather than the configuration, and surprising that another collection was modified (i.e. it doesn't matter if I call collection1/schema or collection2/schema if they use the same configuration). I wouldn't have been surprised if the endpoint was on the configuration somehow (not sure how that would work though). If you really understand how everything works under the covers, I guess it makes sense. I'm not sure how much this matters either. I guess the question is what we do on failures and whether only the collection endpoint you modify has to succeed or all of the collections that use the configuration have to succeed else we run th failure code.
          Hide
          Timothy Potter added a comment -

          Here's a patch that gives a client the option to wait a max number of seconds by passing the updateTimeoutSecs parameter. This triggers the core processing the schema update to block and poll all replicas of the collection until they apply the change or timeout occurs. An exception is thrown if timeout occurs.

          Show
          Timothy Potter added a comment - Here's a patch that gives a client the option to wait a max number of seconds by passing the updateTimeoutSecs parameter. This triggers the core processing the schema update to block and poll all replicas of the collection until they apply the change or timeout occurs. An exception is thrown if timeout occurs.
          Hide
          Timothy Potter added a comment -

          Updated patch - ignore the previous one

          Test coverage for this feature integrated into the TestCloudManagedSchemaConcurrent class.

          Show
          Timothy Potter added a comment - Updated patch - ignore the previous one Test coverage for this feature integrated into the TestCloudManagedSchemaConcurrent class.
          Hide
          Steve Rowe added a comment -
          Show
          Steve Rowe added a comment - Review posted at https://reviews.apache.org/r/25742/
          Hide
          Steve Rowe added a comment -

          One more thought about your patch, Timothy Potter: this is a nice addition to the API, but what do users do after the timeout causes the request to fail? Wait more and then what? There's no way to (later) confirm consistency. Maybe the schema API should support async mode, which would allow for clients to poll for completion.

          Show
          Steve Rowe added a comment - One more thought about your patch, Timothy Potter : this is a nice addition to the API, but what do users do after the timeout causes the request to fail? Wait more and then what? There's no way to (later) confirm consistency. Maybe the schema API should support async mode, which would allow for clients to poll for completion.
          Hide
          Timothy Potter added a comment -

          This mechanism is mainly a convenience for the client to not have to poll the zk version from all replicas themselves. If timeout occurs, then either A) one or more of the replicas couldn't process the update or B) one or more of the replicas was just being really slow. If A, then really the client app can't really proceed safely without resolving the root cause. Not really sure what to do about this without going down the path of having a distributed transaction that allows us to rollback updates if any replicas fail.

          If B, the client can wait more, but then they would have to poll all the replicas themselves, which makes client's implement this same polling solution on their side as well, thus not very convenient.

          One thing would could do to help make it more convenient for dealing with B and even possibly A is to use the solution I proposed for SOLR-6550 to pass back the URLs of the replicas that timed out using the extended exception metadata. That at least narrows the scope for the client but still inconvenient.

          Alternatively, async would work but at some point, doesn't the client have to give up polling? Hence we're back to effectively having a timeout. I took this ticket to mean that a client doesn't want to proceed with more updates until it knows all cores have seen the current update, so async seems to just move the problem out to the client.

          I'm happy to implement the async approach but from where I sit now, I think we should build distributed 2-phase commit transaction support into managed schema as it will be useful going forward for managed config. That way, clients can make a change and then be certain it was either applied entirely or not at all and their cluster remains in a consistent state. This of course would only be applied to schema and config changes so I'm not talking about distributed transactions for Solr in general.

          Show
          Timothy Potter added a comment - This mechanism is mainly a convenience for the client to not have to poll the zk version from all replicas themselves. If timeout occurs, then either A) one or more of the replicas couldn't process the update or B) one or more of the replicas was just being really slow. If A, then really the client app can't really proceed safely without resolving the root cause. Not really sure what to do about this without going down the path of having a distributed transaction that allows us to rollback updates if any replicas fail. If B, the client can wait more, but then they would have to poll all the replicas themselves, which makes client's implement this same polling solution on their side as well, thus not very convenient. One thing would could do to help make it more convenient for dealing with B and even possibly A is to use the solution I proposed for SOLR-6550 to pass back the URLs of the replicas that timed out using the extended exception metadata. That at least narrows the scope for the client but still inconvenient. Alternatively, async would work but at some point, doesn't the client have to give up polling? Hence we're back to effectively having a timeout. I took this ticket to mean that a client doesn't want to proceed with more updates until it knows all cores have seen the current update, so async seems to just move the problem out to the client. I'm happy to implement the async approach but from where I sit now, I think we should build distributed 2-phase commit transaction support into managed schema as it will be useful going forward for managed config. That way, clients can make a change and then be certain it was either applied entirely or not at all and their cluster remains in a consistent state. This of course would only be applied to schema and config changes so I'm not talking about distributed transactions for Solr in general.
          Hide
          Timothy Potter added a comment -

          Took a slightly different approach when waiting to see changes applied across all replicas after discussing this issue with Noble Paul. Essentially, instead of polling each replica until their zk schema version matches the expected, I'm now sending a GET request like: /schema/zkversion?refreshIfBelowVersion=N where N=expected version after saving the updates to ZooKeeper. This tells the replica to proactively refresh the schema if its current version is less than the value of the refreshIfBelowVersion parameter. Of course in a healthy cluster, the watcher on the managed-schema znode will likely fire long before this request comes in, which is why the refreshIfBelowVersion parameter is needed as we don't want to hit ZooKeeper more than necessary. This approach also blocks while the schema is being refreshed so we're not having to send multiple network requests to poll for the version.

          Moreover, if the refresh fails, then likely that replica's ZooKeeper session has expired or something bad like that so that core will go down and need to recover.

          One thing to note here is that technically a GET request can make changes (if the refresh fires) which breaks pure REST principles. So I can easily move the refresh stuff out to a separate endpoint but it seemed more intuitive to just hang this off of the /schema/zkversion endpoint to me. In other words, a conditional PUT seemed just as ugly to me in this case.

          Show
          Timothy Potter added a comment - Took a slightly different approach when waiting to see changes applied across all replicas after discussing this issue with Noble Paul . Essentially, instead of polling each replica until their zk schema version matches the expected, I'm now sending a GET request like: /schema/zkversion?refreshIfBelowVersion=N where N=expected version after saving the updates to ZooKeeper. This tells the replica to proactively refresh the schema if its current version is less than the value of the refreshIfBelowVersion parameter. Of course in a healthy cluster, the watcher on the managed-schema znode will likely fire long before this request comes in, which is why the refreshIfBelowVersion parameter is needed as we don't want to hit ZooKeeper more than necessary. This approach also blocks while the schema is being refreshed so we're not having to send multiple network requests to poll for the version. Moreover, if the refresh fails, then likely that replica's ZooKeeper session has expired or something bad like that so that core will go down and need to recover. One thing to note here is that technically a GET request can make changes (if the refresh fires) which breaks pure REST principles. So I can easily move the refresh stuff out to a separate endpoint but it seemed more intuitive to just hang this off of the /schema/zkversion endpoint to me. In other words, a conditional PUT seemed just as ugly to me in this case.
          Hide
          Noble Paul added a comment -

          I would have preferred to move the ZK watch to ZkStateReader . This will fail in case the zkclient is closed and a new one is opened. ZkStateReader should have an method watchNode(String , listener) and unwatchNode(String, listener) .

          I wonder if this watching all nodes is a scalable way because we will have to watch every single entry in the conf dir . A better idea is to just watch the /conf dir for chldren changed and the listener can lookup the versions of each files and update them locally

          Show
          Noble Paul added a comment - I would have preferred to move the ZK watch to ZkStateReader . This will fail in case the zkclient is closed and a new one is opened. ZkStateReader should have an method watchNode(String , listener) and unwatchNode(String, listener) . I wonder if this watching all nodes is a scalable way because we will have to watch every single entry in the conf dir . A better idea is to just watch the /conf dir for chldren changed and the listener can lookup the versions of each files and update them locally
          Hide
          Steve Rowe added a comment -

          A better idea is to just watch the /conf dir for chldren changed and the listener can lookup the versions of each files and update them locally

          There can be directory children in the conf/ dir, e.g. lang/, and AFAIK, the children changed watch only applies to direct children, not recursively.

          Show
          Steve Rowe added a comment - A better idea is to just watch the /conf dir for chldren changed and the listener can lookup the versions of each files and update them locally There can be directory children in the conf/ dir, e.g. lang/ , and AFAIK, the children changed watch only applies to direct children, not recursively.
          Hide
          Noble Paul added a comment -

          you are right,
          But this model is not scalable at all.

          We can keep a empty node under /conf and just update it everytime any file is modified. It is much better to watch only one node

          Show
          Noble Paul added a comment - you are right, But this model is not scalable at all. We can keep a empty node under /conf and just update it everytime any file is modified. It is much better to watch only one node
          Hide
          Steve Rowe added a comment -

          We can keep a empty node under /conf and just update it everytime any file is modified. It is much better to watch only one node

          That won't work when people upload changes via external tools.

          Show
          Steve Rowe added a comment - We can keep a empty node under /conf and just update it everytime any file is modified. It is much better to watch only one node That won't work when people upload changes via external tools.
          Hide
          Noble Paul added a comment -

          I don't think we should support users writing directly to zk. It will be a performance problem in the future

          Show
          Noble Paul added a comment - I don't think we should support users writing directly to zk. It will be a performance problem in the future
          Hide
          Steve Rowe added a comment -

          Sure, makes sense: aim for a system that doesn't require direct zk access. We're definitely not there yet, so we can't depend on this. Specifically for this issue, we can't design a system that depends on users not writing directly to zk. Right now, writing directly to zk is the only way to modify files not covered by our APIs.

          Show
          Steve Rowe added a comment - Sure, makes sense: aim for a system that doesn't require direct zk access. We're definitely not there yet, so we can't depend on this. Specifically for this issue, we can't design a system that depends on users not writing directly to zk. Right now, writing directly to zk is the only way to modify files not covered by our APIs.
          Hide
          Noble Paul added a comment -

          Use account "steve_rowe" instead I agree with you . It is desirable to support direct manipulation of ZK entries
          .But ,we can still manage with a single watch for children for all files directly under /conf . When we are implementing the other features we can do something else. I don't foresee any problem because we go with this approach

          Show
          Noble Paul added a comment - Use account "steve_rowe" instead I agree with you . It is desirable to support direct manipulation of ZK entries .But ,we can still manage with a single watch for children for all files directly under /conf . When we are implementing the other features we can do something else. I don't foresee any problem because we go with this approach
          Hide
          Timothy Potter added a comment -

          Refining our watcher strategy for conf znodes is definitely important but seems like it should be tackled in a separate JIRA ticket? Any further thoughts on the issues this ticket is trying to address specifically? If not, I'll commit to trunk.

          Show
          Timothy Potter added a comment - Refining our watcher strategy for conf znodes is definitely important but seems like it should be tackled in a separate JIRA ticket? Any further thoughts on the issues this ticket is trying to address specifically? If not, I'll commit to trunk.
          Hide
          Noble Paul added a comment -

          yeah , you are right.
          Let's get this out of the way . I shall open another one.

          Show
          Noble Paul added a comment - yeah , you are right. Let's get this out of the way . I shall open another one.
          Hide
          ASF subversion and git services added a comment -

          Commit 1628181 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1628181 ]

          SOLR-6249: Schema API changes return success before all cores are updated

          Show
          ASF subversion and git services added a comment - Commit 1628181 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1628181 ] SOLR-6249 : Schema API changes return success before all cores are updated
          Hide
          Noble Paul added a comment -

          Does it take care of zkclient losing connections? I

          Show
          Noble Paul added a comment - Does it take care of zkclient losing connections? I
          Hide
          Timothy Potter added a comment -

          It uses ZkIndexSchemaReader, which under the covers calls:

          {{ byte[] data = zkClient.getData(managedSchemaPath, watcher, stat, true); }}

          The last param says to retry on connection loss. Do you think we need to do more than this?

          Show
          Timothy Potter added a comment - It uses ZkIndexSchemaReader, which under the covers calls: {{ byte[] data = zkClient.getData(managedSchemaPath, watcher, stat, true); }} The last param says to retry on connection loss. Do you think we need to do more than this?
          Hide
          Noble Paul added a comment - - edited

          I'm afraid , it is not sufficient. It just retries if the connection is lost in the getData() call. It does not take care of the case where the watcher is lost because of a zkclient failure.

          The correct example is the constructor of ZkStateReader

             zkClient = new SolrZkClient(zkServerAddress, zkClientTimeout, zkClientConnectTimeout,
                  // on reconnect, reload cloud info
                  new OnReconnect() {
                      //implement stuff here
                   }
                  });
          

          this ensures that the watchers are persisted across reconnect.

          This is why I suggested the ZkStateReader is the right place to set watchers

          Show
          Noble Paul added a comment - - edited I'm afraid , it is not sufficient. It just retries if the connection is lost in the getData() call. It does not take care of the case where the watcher is lost because of a zkclient failure. The correct example is the constructor of ZkStateReader zkClient = new SolrZkClient(zkServerAddress, zkClientTimeout, zkClientConnectTimeout, // on reconnect, reload cloud info new OnReconnect() { //implement stuff here } }); this ensures that the watchers are persisted across reconnect. This is why I suggested the ZkStateReader is the right place to set watchers
          Hide
          Timothy Potter added a comment -

          Ok, that makes sense but seems like a generic issue vs. specific to this class. We could solve this specifically right now doing what you suggest but rather than polluting ZkStateReader with utility methods for setting watchers, I think we should have a generic watcher support class. ZkStateReader's job is to handle collection state right? It's not some generic ZK utility class that the rest of the code should use. There should be a ticket for refactoring any code that sets a watcher to do it correctly, including ZkIndexSchemaReader, or is this the only place in the code that doesn't set a watcher correctly?

          Show
          Timothy Potter added a comment - Ok, that makes sense but seems like a generic issue vs. specific to this class. We could solve this specifically right now doing what you suggest but rather than polluting ZkStateReader with utility methods for setting watchers, I think we should have a generic watcher support class. ZkStateReader's job is to handle collection state right? It's not some generic ZK utility class that the rest of the code should use. There should be a ticket for refactoring any code that sets a watcher to do it correctly, including ZkIndexSchemaReader, or is this the only place in the code that doesn't set a watcher correctly?
          Hide
          Noble Paul added a comment -

          I was just giving out an easy solution.This logic can be pushed to SolrZkClient and ZkStateReader and everyone else can use it from there.

          Show
          Noble Paul added a comment - I was just giving out an easy solution.This logic can be pushed to SolrZkClient and ZkStateReader and everyone else can use it from there.
          Hide
          Timothy Potter added a comment -

          Sure but I'm tired of all this tribal knowledge around how to do something right in the code. We need a clear path for future coders to follow and it seems like a watcher support class is the right solution. I've opened SOLR-6571 please make comments there.

          Show
          Timothy Potter added a comment - Sure but I'm tired of all this tribal knowledge around how to do something right in the code. We need a clear path for future coders to follow and it seems like a watcher support class is the right solution. I've opened SOLR-6571 please make comments there.
          Hide
          Mark Miller added a comment -

          This is not really tribal knowledge - it's ZooKeeper 101. Watchers do no persist across session timeouts and you need to re-create any watchers on reconnecting after a session timeout.

          Show
          Mark Miller added a comment - This is not really tribal knowledge - it's ZooKeeper 101. Watchers do no persist across session timeouts and you need to re-create any watchers on reconnecting after a session timeout.
          Hide
          Timothy Potter added a comment -

          Ok fair enough .... clearly the implementer of ZkIndexSchemaReader (not me) missed that day.

          Show
          Timothy Potter added a comment - Ok fair enough .... clearly the implementer of ZkIndexSchemaReader (not me) missed that day.
          Hide
          Steve Rowe added a comment -

          The fault is all mine.

          Show
          Steve Rowe added a comment - The fault is all mine.
          Hide
          Mark Miller added a comment -

          The last param says to retry on connection loss. Do you think we need to do more than this?

          Again, ZooKeeper 101.

          You can lose the ZK connection generally in two ways:

          • ConnectionLoss
          • SessionExpiration

          When you lose the connection due to ConnectionLoss, all the Watches remain. You have a couple choices - you can retry and hope the connection is reestablished, or do something else (see the leader election algorithm for an example of needing to do something else). Usually, you want to retry until you get a session expiration. That is what passing true as the final param does for you - it handles ConnectionLoss in the way you want to handle it 99% of the time.

          SessionExpiration means the connection was lost for too long. Watches do not span sessions. In this case, when you reconnect to ZooKeeper, it's pretty use case specific how you need to handle things, but at a minimum, it usually means re-creating any watches.

          Show
          Mark Miller added a comment - The last param says to retry on connection loss. Do you think we need to do more than this? Again, ZooKeeper 101. You can lose the ZK connection generally in two ways: ConnectionLoss SessionExpiration When you lose the connection due to ConnectionLoss, all the Watches remain. You have a couple choices - you can retry and hope the connection is reestablished, or do something else (see the leader election algorithm for an example of needing to do something else). Usually, you want to retry until you get a session expiration. That is what passing true as the final param does for you - it handles ConnectionLoss in the way you want to handle it 99% of the time. SessionExpiration means the connection was lost for too long. Watches do not span sessions. In this case, when you reconnect to ZooKeeper, it's pretty use case specific how you need to handle things, but at a minimum, it usually means re-creating any watches.
          Hide
          Timothy Potter added a comment -

          I'll commit another fix to address the watcher re-connect issue for ZkIndexSchemaReader.

          Show
          Timothy Potter added a comment - I'll commit another fix to address the watcher re-connect issue for ZkIndexSchemaReader.
          Hide
          Timothy Potter added a comment -

          Here's an additional patch that addresses the ZkIndexSchemaReader watcher reconnect issue discussed in the comments.

          Noble Paul please take a quick look. I know you recommended integrating this into ZkStateReader, but I did it in ZkController instead because on the server-side, ZkStateReader actually uses the OnReconnect implementation in ZkController. The basic idea here is that any component that sets up a watcher can register an OnReconnect listener with ZkController to be called after a reconnected Zk session; see the command method in ZkIndexSchemaReader for an example.

          If this looks acceptable to everyone watching, I'll commit and then backport to branch_5x

          Show
          Timothy Potter added a comment - Here's an additional patch that addresses the ZkIndexSchemaReader watcher reconnect issue discussed in the comments. Noble Paul please take a quick look. I know you recommended integrating this into ZkStateReader, but I did it in ZkController instead because on the server-side, ZkStateReader actually uses the OnReconnect implementation in ZkController. The basic idea here is that any component that sets up a watcher can register an OnReconnect listener with ZkController to be called after a reconnected Zk session; see the command method in ZkIndexSchemaReader for an example. If this looks acceptable to everyone watching, I'll commit and then backport to branch_5x
          Hide
          Timothy Potter added a comment -

          Here's an updated patch for the reconnect issue with some unit testing added to verify the watcher fires correctly after a zk session expiration. I think this is good to go.

          Show
          Timothy Potter added a comment - Here's an updated patch for the reconnect issue with some unit testing added to verify the watcher fires correctly after a zk session expiration. I think this is good to go.
          Hide
          Noble Paul added a comment -

          do we need to make a copy here in ZkController? the only synchronization is on reconnectListeners and it's OK if new listeners addition should wait.

          OnReconnect[] toNotify = null;
                        synchronized (reconnectListeners) {
                          toNotify = reconnectListeners.toArray(new OnReconnect[0]);
                        }
          

          the rest seems to be fine

          Show
          Noble Paul added a comment - do we need to make a copy here in ZkController? the only synchronization is on reconnectListeners and it's OK if new listeners addition should wait. OnReconnect[] toNotify = null ; synchronized (reconnectListeners) { toNotify = reconnectListeners.toArray( new OnReconnect[0]); } the rest seems to be fine
          Hide
          ASF subversion and git services added a comment -

          Commit 1629229 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1629229 ]

          SOLR-6249: support re-establishing a new watcher on the managed schema znode after zk session expiration.

          Show
          ASF subversion and git services added a comment - Commit 1629229 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1629229 ] SOLR-6249 : support re-establishing a new watcher on the managed schema znode after zk session expiration.
          Hide
          ASF subversion and git services added a comment -

          Commit 1629246 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1629246 ]

          SOLR-6249: support re-establishing a new watcher on the managed schema znode after zk session expiration.

          Show
          ASF subversion and git services added a comment - Commit 1629246 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1629246 ] SOLR-6249 : support re-establishing a new watcher on the managed schema znode after zk session expiration.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Timothy Potter
              Reporter:
              Gregory Chanan
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development