Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8416

The collections create API should return after all replicas are active.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Currently the collection creation API returns once all cores are created. In large cluster the cores may not be alive for some period of time after cores are created. For any thing requested during that period, Solr appears unstable and can return failure. Therefore it's better the collection creation API waits for all cores to become alive and returns after that.

      1. SOLR-8416.patch
        2 kB
        Michael Sun
      2. SOLR-8416.patch
        8 kB
        Michael Sun
      3. SOLR-8416.patch
        8 kB
        Michael Sun
      4. SOLR-8416.patch
        18 kB
        Michael Sun
      5. SOLR-8416.patch
        25 kB
        Mark Miller
      6. SOLR-8416.patch
        23 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          michael.sun Michael Sun added a comment -

          A patch is uploaded. Here are some thoughts:

          1. The patch pulls shard status from zk and returns if they are all active during preset time or throw exception. An alternative is to wait for zk notifications but I am not sure how much is the gain.
          2. The total wait time should be configurable to fit large cluster. Is it good to be a solr config or collection config? It's more natural to be collection config but it may be easy for user if it's a solr config that can be set in CM.

          Show
          michael.sun Michael Sun added a comment - A patch is uploaded. Here are some thoughts: 1. The patch pulls shard status from zk and returns if they are all active during preset time or throw exception. An alternative is to wait for zk notifications but I am not sure how much is the gain. 2. The total wait time should be configurable to fit large cluster. Is it good to be a solr config or collection config? It's more natural to be collection config but it may be easy for user if it's a solr config that can be set in CM.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks for the patch, a couple comments:

          • Usually it's better to use SolrException over RuntimeException
          • Where you catch the interrupted exception, you should restore the interrupted status.
          • The failure exception should probably give detail on which replicas were not found to be live and ACTIVE.
          • Should not use hard coded 'active' but the relevant constants.
          • Should probably check if the replicas node is listed under live nodes as well as if it's active?
          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks for the patch, a couple comments: Usually it's better to use SolrException over RuntimeException Where you catch the interrupted exception, you should restore the interrupted status. The failure exception should probably give detail on which replicas were not found to be live and ACTIVE. Should not use hard coded 'active' but the relevant constants. Should probably check if the replicas node is listed under live nodes as well as if it's active?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          1. The patch pulls shard status from zk and returns if they are all active during preset time or throw exception. An alternative is to wait for zk notifications but I am not sure how much is the gain.

          I agree, the complication is probably not worth it.

          Is it good to be a solr config or collection config?

          I would make it a solr config - we need that ease of use at least. Later, if someone wants to override per collection or something, we can look at adding that in a way that you can override.

          Show
          markrmiller@gmail.com Mark Miller added a comment - 1. The patch pulls shard status from zk and returns if they are all active during preset time or throw exception. An alternative is to wait for zk notifications but I am not sure how much is the gain. I agree, the complication is probably not worth it. Is it good to be a solr config or collection config? I would make it a solr config - we need that ease of use at least. Later, if someone wants to override per collection or something, we can look at adding that in a way that you can override.
          Hide
          michael.sun Michael Sun added a comment -

          Thanks Mark Miller for reviewing.

          Show
          michael.sun Michael Sun added a comment - Thanks Mark Miller for reviewing.
          Hide
          gchanan Gregory Chanan added a comment -

          The patch claims to be looking at all shards being active but is actually looking at all replicas being active, right? It's also inconsistent with the other creation commands now, e.g. CREATESHARD/ADDREPLICA will return as soon as the replicas are created while this will wait until all the replicas are active.

          From a client perspective, what do you actually want? I don't think it's that all replicas are active at one time; given a large enough cluster that can be unlikely. Some possibilities:
          1) all the replicas were able to become active (you'd have to track this separately)
          2) the collection is "usable" from the client – each shard has a leader that is live and active?

          Show
          gchanan Gregory Chanan added a comment - The patch claims to be looking at all shards being active but is actually looking at all replicas being active, right? It's also inconsistent with the other creation commands now, e.g. CREATESHARD/ADDREPLICA will return as soon as the replicas are created while this will wait until all the replicas are active. From a client perspective, what do you actually want? I don't think it's that all replicas are active at one time; given a large enough cluster that can be unlikely. Some possibilities: 1) all the replicas were able to become active (you'd have to track this separately) 2) the collection is "usable" from the client – each shard has a leader that is live and active?
          Hide
          michael.sun Michael Sun added a comment -

          the collection is "usable" from the client – each shard has a leader that is live and active?

          From client perspective, the collection or cluster is usable if the leader is active. However if high load of indexing starts before all replica becomes active, I am not sure if there will be too much sync traffic once replica starts.

          Show
          michael.sun Michael Sun added a comment - the collection is "usable" from the client – each shard has a leader that is live and active? From client perspective, the collection or cluster is usable if the leader is active. However if high load of indexing starts before all replica becomes active, I am not sure if there will be too much sync traffic once replica starts.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Yeah, the load is a good point. It's a lot harder to go active once a load is started. And you would hope that right after creating the collection, everything would actually be healthy, but I see Greg's point too. I've always wrestled a bit with these ideas when considering this feature. It's not easy to make everything happy.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Yeah, the load is a good point. It's a lot harder to go active once a load is started. And you would hope that right after creating the collection, everything would actually be healthy, but I see Greg's point too. I've always wrestled a bit with these ideas when considering this feature. It's not easy to make everything happy.
          Hide
          gchanan Gregory Chanan added a comment -

          I'd see what hbase or other systems do.

          Show
          gchanan Gregory Chanan added a comment - I'd see what hbase or other systems do.
          Hide
          michael.sun Michael Sun added a comment -

          Yeah. It's not easy to make tradeoff, in particular optimized for large cluster and high load like this. In addition to looking at other systems, another choice (which probably always works) is to add an option to allow user to decide if Solr should wait for all shard leaders to be active or all replicas at run time.

          The additional option need to be documented and adds a bit to the learning curve for users. But in general it won't create much burden since users usually are prepared to do some extra optimization for large cluster under high load.

          Show
          michael.sun Michael Sun added a comment - Yeah. It's not easy to make tradeoff, in particular optimized for large cluster and high load like this. In addition to looking at other systems, another choice (which probably always works) is to add an option to allow user to decide if Solr should wait for all shard leaders to be active or all replicas at run time. The additional option need to be documented and adds a bit to the learning curve for users. But in general it won't create much burden since users usually are prepared to do some extra optimization for large cluster under high load.
          Hide
          michael.sun Michael Sun added a comment -

          Just upload an updated patch for discussion. Here is the changes

          1. add a property to set max wait time
          2. add a property to decide if it waits for all shard leaders to be active or all replicas
          3. fix issues in Mark Miller's review except for the following one.

          Should probably check if the replicas node is listed under live nodes as well as if it's active?

          Mark Miller Can you give me more details about it? Thanks.

          Show
          michael.sun Michael Sun added a comment - Just upload an updated patch for discussion. Here is the changes 1. add a property to set max wait time 2. add a property to decide if it waits for all shard leaders to be active or all replicas 3. fix issues in Mark Miller 's review except for the following one. Should probably check if the replicas node is listed under live nodes as well as if it's active? Mark Miller Can you give me more details about it? Thanks.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Should probably check if the replicas node is listed under live nodes as well as if it's active?

          Can you give me more details about it?

          For technical reasons, the actual state of a replica is a combination of whether it's ephemeral live node exists in zookeeper and the state listed in the cluster state. We make a best effort on shutdown to publish DOWN for all the states, but it's simply best effort and crashes and other probably common things can mean any state is in the cluster state. You can really only count on it being an accurate state if you also check that the node is live. The ClusterState object has a helper method for this if I remember right.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Should probably check if the replicas node is listed under live nodes as well as if it's active? Can you give me more details about it? For technical reasons, the actual state of a replica is a combination of whether it's ephemeral live node exists in zookeeper and the state listed in the cluster state. We make a best effort on shutdown to publish DOWN for all the states, but it's simply best effort and crashes and other probably common things can mean any state is in the cluster state. You can really only count on it being an accurate state if you also check that the node is live. The ClusterState object has a helper method for this if I remember right.
          Hide
          michael.sun Michael Sun added a comment -

          Here is an updated patch which includes checking for live nodes. Thanks Mark Miller for suggestion.

          Show
          michael.sun Michael Sun added a comment - Here is an updated patch which includes checking for live nodes. Thanks Mark Miller for suggestion.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks Michael,

          • Looks like a bunch of imports were moved above the license header?
          • We probably want to use real solr.xml config for this. Or make it params for the collection create call with reasonable defaults. We generally only use system properties for kind of internal fail safe options we don't expect to really be used. I'd be fine with reasonable defaults that could be overridden per collection create call, but we could also allow the defaults to be configurable via solr.xml.
            +    Integer numRetries = Integer.getInteger("createCollectionWaitTimeTillActive", 10);
            +    Boolean checkLeaderOnly = Boolean.getBoolean("createCollectionCheckLeaderActive");
            
          • We should handle the checked exceptions this might throw like we do in other spots rather than use a catch-all Exception. There should be plenty of code to reference where we handle keeper and interrupted exception and do the right thing for each.
            +      try {
            +        zkStateReader.updateClusterState();
            +        clusterState = zkStateReader.getClusterState();
            +      }  catch (Exception e) {
            +        throw new SolrException(ErrorCode.SERVER_ERROR, "Can't connect to zk server", e);
            +      }
            
          • I'd probably combine the following into one IF statement:
            +          if (!clusterState.liveNodesContain(replica.getNodeName())) {
            +            replicaNotAlive = replica.getCoreUrl();
            +            nodeNotLive = replica.getNodeName();
            +            break;
            +          }
            +          if (!state.equals(Replica.State.ACTIVE.toString())) {
            +            replicaNotAlive = replica.getCoreUrl();
            +            replicaState = state;
            +            break;
            +          }
            
          • Should probably restore interrupt status and throw a SolrException.
            +      try {
            +        Thread.sleep(1000);
            +      } catch (InterruptedException e) {
            +        Thread.currentThread().interrupt();
            +      }
            
          • I'm not sure the return message is quite right. If a nodes state is not ACTIVE, it does not mean it's not Live. It can be DOWN and live or RECOVERING and Live, etc. A replica is either Live or not and then has a Live State if and only if it is Live.
          • Needs some tests.
          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks Michael, Looks like a bunch of imports were moved above the license header? We probably want to use real solr.xml config for this. Or make it params for the collection create call with reasonable defaults. We generally only use system properties for kind of internal fail safe options we don't expect to really be used. I'd be fine with reasonable defaults that could be overridden per collection create call, but we could also allow the defaults to be configurable via solr.xml. + Integer numRetries = Integer .getInteger( "createCollectionWaitTimeTillActive" , 10); + Boolean checkLeaderOnly = Boolean .getBoolean( "createCollectionCheckLeaderActive" ); We should handle the checked exceptions this might throw like we do in other spots rather than use a catch-all Exception. There should be plenty of code to reference where we handle keeper and interrupted exception and do the right thing for each. + try { + zkStateReader.updateClusterState(); + clusterState = zkStateReader.getClusterState(); + } catch (Exception e) { + throw new SolrException(ErrorCode.SERVER_ERROR, "Can't connect to zk server" , e); + } I'd probably combine the following into one IF statement: + if (!clusterState.liveNodesContain(replica.getNodeName())) { + replicaNotAlive = replica.getCoreUrl(); + nodeNotLive = replica.getNodeName(); + break ; + } + if (!state.equals(Replica.State.ACTIVE.toString())) { + replicaNotAlive = replica.getCoreUrl(); + replicaState = state; + break ; + } Should probably restore interrupt status and throw a SolrException. + try { + Thread .sleep(1000); + } catch (InterruptedException e) { + Thread .currentThread().interrupt(); + } I'm not sure the return message is quite right. If a nodes state is not ACTIVE, it does not mean it's not Live. It can be DOWN and live or RECOVERING and Live, etc. A replica is either Live or not and then has a Live State if and only if it is Live. Needs some tests.
          Hide
          michael.sun Michael Sun added a comment -

          Thanks Mark Miller for review. I made a patch with all fixes except for tests. Can you give me some suggestion about how to design test for it? I can add a verification in CollectionsAPIDistributedZkTest.createCollection() to replace waitForRecoveriesToFinish(). But to test the patch completely, it seems to me that it requires injection of delays or failure in core creation which I didn't find a good way to do.

          Show
          michael.sun Michael Sun added a comment - Thanks Mark Miller for review. I made a patch with all fixes except for tests. Can you give me some suggestion about how to design test for it? I can add a verification in CollectionsAPIDistributedZkTest.createCollection() to replace waitForRecoveriesToFinish(). But to test the patch completely, it seems to me that it requires injection of delays or failure in core creation which I didn't find a good way to do.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          requires injection of delays or failure in core creation which I didn't find a good way to do

          This has been lacking in our test framework for a long time for a variety of reasons. I recently started working on something to inject faults though - we have to have it.

          Take a look at https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/util/TestInjection.java

          We should use it carefully, and in the right places, but I think testing collection creation fail cases fits the bill.

          We use assertions to call the fault injection methods so that they cannot be called by mistake in production (Solr tests require assertions are on, and we don't properly support running with assertions on in production).

          Show
          markrmiller@gmail.com Mark Miller added a comment - requires injection of delays or failure in core creation which I didn't find a good way to do This has been lacking in our test framework for a long time for a variety of reasons. I recently started working on something to inject faults though - we have to have it. Take a look at https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/util/TestInjection.java We should use it carefully, and in the right places, but I think testing collection creation fail cases fits the bill. We use assertions to call the fault injection methods so that they cannot be called by mistake in production (Solr tests require assertions are on, and we don't properly support running with assertions on in production).
          Hide
          michael.sun Michael Sun added a comment -

          It's great you have been working on injection. I can use it as an example. Thanks Mark Miller for suggestion.

          Show
          michael.sun Michael Sun added a comment - It's great you have been working on injection. I can use it as an example. Thanks Mark Miller for suggestion.
          Hide
          michael.sun Michael Sun added a comment - - edited

          Here is the patch with delay in core creation injected to simulate slow core creation or recovering situations in real world scenarios. Basically it extends the TestInjection.java to inject delay.

          The patch also clean up the code for configurations, exceptions and error messages.
          cc Mark Miller

          Show
          michael.sun Michael Sun added a comment - - edited Here is the patch with delay in core creation injected to simulate slow core creation or recovering situations in real world scenarios. Basically it extends the TestInjection.java to inject delay. The patch also clean up the code for configurations, exceptions and error messages. cc Mark Miller
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I have to spend a bit of time playing with this manually still, but latest patch looks great!

          Show
          markrmiller@gmail.com Mark Miller added a comment - I have to spend a bit of time playing with this manually still, but latest patch looks great!
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I've started making a few changes - all the tests were not yet consistently passing, and I had some thoughts about things we might improve.

          I'd still like to look at not waiting for replicas we know failed creating, but it's a little tricky.

          Here is my current progress attached.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I've started making a few changes - all the tests were not yet consistently passing, and I had some thoughts about things we might improve. I'd still like to look at not waiting for replicas we know failed creating, but it's a little tricky. Here is my current progress attached.
          Hide
          markrmiller@gmail.com Mark Miller added a comment - - edited

          Another patch, I think this is almost ready. I punted for now on making it not wait for specific failures (there are some annoying complications around error reporting atm), but I made some other improvements.

          Show
          markrmiller@gmail.com Mark Miller added a comment - - edited Another patch, I think this is almost ready. I punted for now on making it not wait for specific failures (there are some annoying complications around error reporting atm), but I made some other improvements.
          Hide
          michael.sun Michael Sun added a comment -

          Ah yes, it makes sense to skip waiting for replicas to be alive for async calls or in case there is failure.

          One question, it uses both rsp.getException() and response.getResponse().get("exception"). Are they pointing to the same exception? Thanks.

          Also there seems a typo error that return is not included.

              if (response.getResponse().get("failure") != null) {
                // TODO: we should not wait for Replicas we know failed
              }
          
          Show
          michael.sun Michael Sun added a comment - Ah yes, it makes sense to skip waiting for replicas to be alive for async calls or in case there is failure. One question, it uses both rsp.getException() and response.getResponse().get("exception"). Are they pointing to the same exception? Thanks. Also there seems a typo error that return is not included. if (response.getResponse().get( "failure" ) != null ) { // TODO: we should not wait for Replicas we know failed }
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Are they pointing to the same exception?

          rsp.getException happens when the response is not code 200 okay - but the collections API works a little different in that sometimes it will put failures and exceptions in a call that returns 200 okay. We just check both cases.

          Also there seems a typo error that return is not included.

          Not waiting for individual replicas that did not create is left as a TODO there, we don't want to do anything yet.

          I also moved the waiting code to the CollectionsHandler. I think it's more efficient and 'safer' to pull the waiting out of the Overseer processing.

          I think we can commit this as a good start and use further JIRA's to make any improvements.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Are they pointing to the same exception? rsp.getException happens when the response is not code 200 okay - but the collections API works a little different in that sometimes it will put failures and exceptions in a call that returns 200 okay. We just check both cases. Also there seems a typo error that return is not included. Not waiting for individual replicas that did not create is left as a TODO there, we don't want to do anything yet. I also moved the waiting code to the CollectionsHandler. I think it's more efficient and 'safer' to pull the waiting out of the Overseer processing. I think we can commit this as a good start and use further JIRA's to make any improvements.
          Hide
          michael.sun Michael Sun added a comment -

          I see. Thanks Mark Miller.

          Show
          michael.sun Michael Sun added a comment - I see. Thanks Mark Miller .
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          For some reason the commit doesn't seem to have been tagged in JIRA. This is committed though. SHA:31437c9b43cf93128e284e278470a39b2012a6cb

          Show
          markrmiller@gmail.com Mark Miller added a comment - For some reason the commit doesn't seem to have been tagged in JIRA. This is committed though. SHA:31437c9b43cf93128e284e278470a39b2012a6cb
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks Michael!

          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks Michael!
          Hide
          erickerickson Erick Erickson added a comment -

          Mark Miller The fix version is just "master". I found the entry in CHANGES.txt in 6x and trunk, should 6x be added to the fix versions?

          Show
          erickerickson Erick Erickson added a comment - Mark Miller The fix version is just "master". I found the entry in CHANGES.txt in 6x and trunk, should 6x be added to the fix versions?
          Hide
          anshumg Anshum Gupta added a comment -

          Opening to back port for 5.5.1.

          Show
          anshumg Anshum Gupta added a comment - Opening to back port for 5.5.1.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a494ac95fc2c22004d89952f7262ceac8368b6c9 in lucene-solr's branch refs/heads/branch_5x from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a494ac9 ]

          SOLR-8416: The collections create API should return after all replicas are active.

          Show
          jira-bot ASF subversion and git services added a comment - Commit a494ac95fc2c22004d89952f7262ceac8368b6c9 in lucene-solr's branch refs/heads/branch_5x from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a494ac9 ] SOLR-8416 : The collections create API should return after all replicas are active.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 4c88ea5532c2e23d7b29ba88ce41e19f7c58e691 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4c88ea5 ]

          SOLR-8416: The collections create API should return after all replicas are active.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 4c88ea5532c2e23d7b29ba88ce41e19f7c58e691 in lucene-solr's branch refs/heads/branch_5_5 from markrmiller [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4c88ea5 ] SOLR-8416 : The collections create API should return after all replicas are active.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The fix version is just "master". I found the entry in CHANGES.txt in 6x and trunk, should 6x be added to the fix versions?

          master was supposed to be renamed to 6.

          Show
          markrmiller@gmail.com Mark Miller added a comment - The fix version is just "master". I found the entry in CHANGES.txt in 6x and trunk, should 6x be added to the fix versions? master was supposed to be renamed to 6.
          Hide
          anshumg Anshum Gupta added a comment -

          Think you accidentally removed the 5.5.1 fix version . I had back ported this to be released with 5.5.1. I'll add that fix version back.

          Show
          anshumg Anshum Gupta added a comment - Think you accidentally removed the 5.5.1 fix version . I had back ported this to be released with 5.5.1. I'll add that fix version back.
          Hide
          stephlag Stephan Lagraulet added a comment -

          Can you close the issue as it is commited on the 5.5 branch?

          Show
          stephlag Stephan Lagraulet added a comment - Can you close the issue as it is commited on the 5.5 branch?

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              michael.sun Michael Sun
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development