Solr
  1. Solr
  2. SOLR-4044

CloudSolrServer early connect problems

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 5.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      If you call CloudSolrServer.connect() after Zookeeper is up, but before clusterstate, etc. is populated, you will get "No live SolrServer" exceptions (line 322 in LBHttpSolrServer):

      throw new SolrServerException("No live SolrServers available to handle this request");

      for all requests made even though all the Solr nodes are coming up just fine.

      1. SOLR-4044.patch
        4 kB
        Vitaliy Zhovtyuk
      2. SOLR-4044-waitforcluster.patch
        17 kB
        Alan Woodward
      3. SOLR-4044-waitforcluster.patch
        18 kB
        Alan Woodward
      4. SOLR-4044-waitforcluster.patch
        10 kB
        Alan Woodward

        Issue Links

          Activity

          Hide
          Grant Ingersoll added a comment - - edited

          I think this also happens even if connect is called via the process() method. AFAICT, from reading the code, it is due to ZkStateReader being instantiated once in the connect string, but not really being notified if/when cores/collections are available b/c it doesn't have anything it can set watches on, yet, b/c Solr may not be ready yet.

          I will try to write up a test to see if I can reproduce this.

          Show
          Grant Ingersoll added a comment - - edited I think this also happens even if connect is called via the process() method. AFAICT, from reading the code, it is due to ZkStateReader being instantiated once in the connect string, but not really being notified if/when cores/collections are available b/c it doesn't have anything it can set watches on, yet, b/c Solr may not be ready yet. I will try to write up a test to see if I can reproduce this.
          Hide
          Yonik Seeley added a comment -

          Hmmm, so if I'm understanding you correctly, if you start up the CloudSolrServer early, then even after the cluster is ready, the CloudSolrServer never sees it as ready.

          If the issue truly is in ZkStateReader, this may be an issue beyond CloudSolrServer and cause problems in the servers themselves? Or perhaps the code that creates all the collection template stuff if none exists has been preventing this from being seen?

          If it looks like there is no clusterstate.json it seems like we should have an option to bail out immediately (with an appropriate error message).

          Do we also need a mode that waits and retries for a certain amount of time?

          Show
          Yonik Seeley added a comment - Hmmm, so if I'm understanding you correctly, if you start up the CloudSolrServer early, then even after the cluster is ready, the CloudSolrServer never sees it as ready. If the issue truly is in ZkStateReader, this may be an issue beyond CloudSolrServer and cause problems in the servers themselves? Or perhaps the code that creates all the collection template stuff if none exists has been preventing this from being seen? If it looks like there is no clusterstate.json it seems like we should have an option to bail out immediately (with an appropriate error message). Do we also need a mode that waits and retries for a certain amount of time?
          Hide
          Mark Miller added a comment -

          I think since a core or the overseer is not up yet, the nodes to watch are simply not there - a fix may be as simple as allowing the cloudsolrserver or zkstatereader to create the nodes it's expecting to watch if it doesn't find them.

          Show
          Mark Miller added a comment - I think since a core or the overseer is not up yet, the nodes to watch are simply not there - a fix may be as simple as allowing the cloudsolrserver or zkstatereader to create the nodes it's expecting to watch if it doesn't find them.
          Hide
          Grant Ingersoll added a comment -

          FWIW, my workaround is simply to catch the exception and to recreate a new CloudSolrServer and try again (with some backoff logic).

          Show
          Grant Ingersoll added a comment - FWIW, my workaround is simply to catch the exception and to recreate a new CloudSolrServer and try again (with some backoff logic).
          Hide
          Grant Ingersoll added a comment -

          Also, note, I'm still working to confirm what is happening with a standalone test, but it may take me a bit.

          Show
          Grant Ingersoll added a comment - Also, note, I'm still working to confirm what is happening with a standalone test, but it may take me a bit.
          Hide
          Yonik Seeley added a comment -

          a fix may be as simple as allowing the cloudsolrserver or zkstatereader to create the nodes it's expecting to watch if it doesn't find them.

          A retry seems much safer... after all, someone could have given the wrong path. Seems like clients should normally be read-only w.r.t. ZK.

          Show
          Yonik Seeley added a comment - a fix may be as simple as allowing the cloudsolrserver or zkstatereader to create the nodes it's expecting to watch if it doesn't find them. A retry seems much safer... after all, someone could have given the wrong path. Seems like clients should normally be read-only w.r.t. ZK.
          Hide
          Mark Miller added a comment -

          Yeah, makes sense.

          Show
          Mark Miller added a comment - Yeah, makes sense.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.
          Hide
          Vitaliy Zhovtyuk added a comment -

          Added test to reproduce the issue

          Show
          Vitaliy Zhovtyuk added a comment - Added test to reproduce the issue
          Hide
          Alan Woodward added a comment -

          Moving the conversation over from SOLR-7146...

          Seems like clients should normally be read-only w.r.t. ZK.

          This isn't the case at the moment - ZkStateReader.createClusterStateWatchersAndUpdate already calls ensureExists() on the aliases and clusterstate.json znodes. I agree that it sounds like a good idea, though.

          Maybe a nicer way forward would be:

          • throw a 503 SolrServerException if any of the watcher nodes don't exist in ZkStateReader when cCSWAU is called
          • add a sugar waitForCluster(timeout) method to CloudSolrClient that will repeatedly check zk for the relevant nodes

          Also, if we want to really ensure that clients never actually change ZK, we could add a ReadOnlyZkClient that subclasses SolrZkClient and throws UOE on makePath() and setData(), and make CloudSolrClient use that.

          Show
          Alan Woodward added a comment - Moving the conversation over from SOLR-7146 ... Seems like clients should normally be read-only w.r.t. ZK. This isn't the case at the moment - ZkStateReader.createClusterStateWatchersAndUpdate already calls ensureExists() on the aliases and clusterstate.json znodes. I agree that it sounds like a good idea, though. Maybe a nicer way forward would be: throw a 503 SolrServerException if any of the watcher nodes don't exist in ZkStateReader when cCSWAU is called add a sugar waitForCluster(timeout) method to CloudSolrClient that will repeatedly check zk for the relevant nodes Also, if we want to really ensure that clients never actually change ZK, we could add a ReadOnlyZkClient that subclasses SolrZkClient and throws UOE on makePath() and setData(), and make CloudSolrClient use that.
          Hide
          Alan Woodward added a comment -

          Patch with test. ZkStateReader doesn't try and create any nodes any more, and will throw an exception in createClusterStateWatchersAndUpdate if the relevant nodes aren't there. Nodes are instead created in ZkController.

          Show
          Alan Woodward added a comment - Patch with test. ZkStateReader doesn't try and create any nodes any more, and will throw an exception in createClusterStateWatchersAndUpdate if the relevant nodes aren't there. Nodes are instead created in ZkController.
          Hide
          Alan Woodward added a comment -

          Better patch, with some fixes for OverseerTest and ZkStateWriterTest. All tests pass.

          Show
          Alan Woodward added a comment - Better patch, with some fixes for OverseerTest and ZkStateWriterTest. All tests pass.
          Hide
          Shalin Shekhar Mangar added a comment -

          I am hesitant to introduce a method called waitForCluster. Can we overload connect with a boolean waitForCluster and timeout parameters instead?

          Show
          Shalin Shekhar Mangar added a comment - I am hesitant to introduce a method called waitForCluster. Can we overload connect with a boolean waitForCluster and timeout parameters instead?
          Hide
          Alan Woodward added a comment -

          Can we overload connect with a boolean waitForCluster and timeout parameters instead?

          Yes, that's nicer. No need for the boolean as the presence of timeout parameters implies that you're ready to wait. Here's a patch.

          Show
          Alan Woodward added a comment - Can we overload connect with a boolean waitForCluster and timeout parameters instead? Yes, that's nicer. No need for the boolean as the presence of timeout parameters implies that you're ready to wait. Here's a patch.
          Hide
          Shalin Shekhar Mangar added a comment -

          +1

          Show
          Shalin Shekhar Mangar added a comment - +1
          Hide
          Alan Woodward added a comment -

          Thanks everyone!

          Show
          Alan Woodward added a comment - Thanks everyone!
          Hide
          ASF subversion and git services added a comment -

          Commit 1665174 from Alan Woodward in branch 'dev/trunk'
          [ https://svn.apache.org/r1665174 ]

          SOLR-4044: CloudSolrClient.connect() can take a timeout parameter to wait for the cluster

          Show
          ASF subversion and git services added a comment - Commit 1665174 from Alan Woodward in branch 'dev/trunk' [ https://svn.apache.org/r1665174 ] SOLR-4044 : CloudSolrClient.connect() can take a timeout parameter to wait for the cluster
          Hide
          ASF subversion and git services added a comment -

          Commit 1665175 from Alan Woodward in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1665175 ]

          SOLR-4044: CloudSolrClient.connect() can take a timeout parameter to wait for the cluster

          Show
          ASF subversion and git services added a comment - Commit 1665175 from Alan Woodward in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1665175 ] SOLR-4044 : CloudSolrClient.connect() can take a timeout parameter to wait for the cluster
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Alan Woodward
              Reporter:
              Grant Ingersoll
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development