Solr
  1. Solr
  2. SOLR-4165

Queries blocked when stopping a node

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: Trunk
    • Fix Version/s: 4.9, Trunk
    • Component/s: search, SolrCloud
    • Labels:
      None
    • Environment:

      5.0-SNAPSHOT 1366361:1420056M - markus - 2012-12-11 11:52:06

      Description

      Our 10 node test cluster (10 shards, 20 cores) blocks incoming queries briefly when a node is stopped gracefully and again blocks queries for at least a few seconds when the node is started again.

      We're using siege to send roughly 10 queries per second to a pair a load balancers. Those load balancers ping (admin/ping) each node every few hundres milliseconds. The ping queries continue to operate normally while the requests to our main request handler is blocked. A manual request directly to a live Solr node is also blocked for the same duration.

      There are no errors logged. But it is clear that the the entire cluster blocks queries as soon as the starting node is reading its config from Zookeeper, likely even slightly earlier.

      The blocking time when stopping a node varies between 1 or 5 seconds. The blocking time when starting a node varies between 10 up to 30 seconds. The blocked queries come rushing in again after a queue of ping requests are served. The ping request sets the main request handler via the qt parameter.

      UPDATE:
      Since SOLR-3655 queries are no longer blocked when starting a node, only for a few seconds when a stopping node using Solr 5.0.0.2013.02.15.13.26.04

        Issue Links

          Activity

          Hide
          Markus Jelsma added a comment -

          Anyone here to test whether this issue applies to 4.x as well?

          Show
          Markus Jelsma added a comment - Anyone here to test whether this issue applies to 4.x as well?
          Hide
          Mark Miller added a comment -

          4x and 5x are pretty much in alignment these days.

          Are you still seeing this? Very strange...

          Show
          Mark Miller added a comment - 4x and 5x are pretty much in alignment these days. Are you still seeing this? Very strange...
          Hide
          Markus Jelsma added a comment -

          Yes. Query time is consistent until a node starts. A few seconds after start up all other nodes stop responding for a significant period (10-30 seconds). When that time has passed, the nodes suddenly send the response again.

          Show
          Markus Jelsma added a comment - Yes. Query time is consistent until a node starts. A few seconds after start up all other nodes stop responding for a significant period (10-30 seconds). When that time has passed, the nodes suddenly send the response again.
          Hide
          Markus Jelsma added a comment - - edited

          We're also seeing the restarted node as ACTIVE immediately after start up in the cluster state but it's schema and index have not been loaded yet, only after everything is initialized the state becomes RECOVERING. Is it possible it's ACTIVE too early so the other nodes query it but do not receive reply until it's fully initialized?

          Show
          Markus Jelsma added a comment - - edited We're also seeing the restarted node as ACTIVE immediately after start up in the cluster state but it's schema and index have not been loaded yet, only after everything is initialized the state becomes RECOVERING. Is it possible it's ACTIVE too early so the other nodes query it but do not receive reply until it's fully initialized?
          Hide
          Mark Miller added a comment -

          Probably - good thought. Take a look at SOLR-3655 by the way.

          I'll try and think on this some...

          Show
          Mark Miller added a comment - Probably - good thought. Take a look at SOLR-3655 by the way. I'll try and think on this some...
          Hide
          Markus Jelsma added a comment -

          SOLR-3655 sounds like what i describe. Seems i opened a duplicate. Thanks!

          Show
          Markus Jelsma added a comment - SOLR-3655 sounds like what i describe. Seems i opened a duplicate. Thanks!
          Hide
          Mark Miller added a comment -

          Hey Markus - how were you stopping the node? Standard stop or kill? A standard stop should pull the node out of live nodes pretty darn quickly...

          Show
          Mark Miller added a comment - Hey Markus - how were you stopping the node? Standard stop or kill? A standard stop should pull the node out of live nodes pretty darn quickly...
          Hide
          Markus Jelsma added a comment -

          Hi Mark, this is for standard stops. On shutdown the cluster can stall very briefly, a matter of 1 or 2 seconds at most in our case. On start up the problem is more serious.

          Show
          Markus Jelsma added a comment - Hi Mark, this is for standard stops. On shutdown the cluster can stall very briefly, a matter of 1 or 2 seconds at most in our case. On start up the problem is more serious.
          Hide
          Steve Rowe added a comment -

          Mark, can this be resolved as a duplicate of SOLR-3655?

          Show
          Steve Rowe added a comment - Mark, can this be resolved as a duplicate of SOLR-3655 ?
          Hide
          Mark Miller added a comment -

          Yeah, resolving as a duplicate - I'll solve this in SOLR-3655.

          Show
          Mark Miller added a comment - Yeah, resolving as a duplicate - I'll solve this in SOLR-3655 .
          Hide
          Markus Jelsma added a comment -

          Mark, i'm not sure this issue is entirely resolved. If i'm doing a stress test against a cluster and restart a node, the entire cluster still gets blocked. SOLR-3655 did improve things a lot, now the cluster only gets blocked when a node stops. When the node starts up again the stress test continues without being interrupted.

          Show
          Markus Jelsma added a comment - Mark, i'm not sure this issue is entirely resolved. If i'm doing a stress test against a cluster and restart a node, the entire cluster still gets blocked. SOLR-3655 did improve things a lot, now the cluster only gets blocked when a node stops. When the node starts up again the stress test continues without being interrupted.
          Hide
          Mark Miller added a comment -

          i guess reopen and rename this is the best move

          Show
          Mark Miller added a comment - i guess reopen and rename this is the best move
          Hide
          Mark Miller added a comment - - edited

          These can't be blocked for long right? We are publishing down before shutting down - my guess is that doing what we do on startup - waiting to see the published state show up in our clusterstate, will deal with this, but must be a pretty short time difference even now. I suppose I could have said the same thing about startup though.

          Show
          Mark Miller added a comment - - edited These can't be blocked for long right? We are publishing down before shutting down - my guess is that doing what we do on startup - waiting to see the published state show up in our clusterstate, will deal with this, but must be a pretty short time difference even now. I suppose I could have said the same thing about startup though.
          Hide
          Markus Jelsma added a comment -

          Correctt. Start up was up to 30 seconds and on shutdown not more than a few seconds. Shutdown still stops the world for about two seconds, not a lot more but you'll clearly notice it when a stream of HTTP requests suddenly freezes.

          Show
          Markus Jelsma added a comment - Correctt. Start up was up to 30 seconds and on shutdown not more than a few seconds. Shutdown still stops the world for about two seconds, not a lot more but you'll clearly notice it when a stream of HTTP requests suddenly freezes.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1446914

          SOLR-4421,SOLR-4165: On CoreContainer shutdown, all SolrCores should publish their state as DOWN.

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1446914 SOLR-4421 , SOLR-4165 : On CoreContainer shutdown, all SolrCores should publish their state as DOWN.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1446926

          SOLR-4421,SOLR-4165: Fix wait loop to sleep, reduce max wait time, wait min 1 second

          Show
          Commit Tag Bot added a comment - [trunk commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1446926 SOLR-4421 , SOLR-4165 : Fix wait loop to sleep, reduce max wait time, wait min 1 second
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1446938

          SOLR-4421,SOLR-4165: On CoreContainer shutdown, all SolrCores should publish their state as DOWN.

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1446938 SOLR-4421 , SOLR-4165 : On CoreContainer shutdown, all SolrCores should publish their state as DOWN.
          Hide
          Mark Miller added a comment -

          Can you try out the latest when you get a chance Markus?

          Show
          Mark Miller added a comment - Can you try out the latest when you get a chance Markus?
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Mark Robert Miller
          http://svn.apache.org/viewvc?view=revision&revision=1446939

          SOLR-4421,SOLR-4165: Fix wait loop to sleep, reduce max wait time, wait min 1 second

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1446939 SOLR-4421 , SOLR-4165 : Fix wait loop to sleep, reduce max wait time, wait min 1 second
          Hide
          Markus Jelsma added a comment -

          Hi Mark, the problem persists and it seems to block slightly longer on shutdown than the previous check out (15th). It waits about 3-4 seconds now.

          Show
          Markus Jelsma added a comment - Hi Mark, the problem persists and it seems to block slightly longer on shutdown than the previous check out (15th). It waits about 3-4 seconds now.
          Hide
          Mark Miller added a comment -

          Bah, that's odd.

          Show
          Mark Miller added a comment - Bah, that's odd.
          Hide
          Markus Jelsma added a comment -

          I've double checked. All nodes run the today's build. I did the test after all nodes got the upgrade. To make things really sure i did another test just now. .... ... and it still blocks.

          Show
          Markus Jelsma added a comment - I've double checked. All nodes run the today's build. I did the test after all nodes got the upgrade. To make things really sure i did another test just now. .... ... and it still blocks.
          Hide
          Mark Miller added a comment -

          I'm guessing we get our shutdown call in filter#destroy after jetty already stops taking connections - or something along those lines.

          Show
          Mark Miller added a comment - I'm guessing we get our shutdown call in filter#destroy after jetty already stops taking connections - or something along those lines.
          Hide
          Mark Miller added a comment -

          I think the best we can do is offer an explicit api to first publish the node as down in zookeeper, then you do a hard stop.

          Show
          Mark Miller added a comment - I think the best we can do is offer an explicit api to first publish the node as down in zookeeper, then you do a hard stop.
          Hide
          Steve Rowe added a comment -

          Bulk move 4.4 issues to 4.5 and 5.0

          Show
          Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
          Hide
          Uwe Schindler added a comment -

          Move issue to Solr 4.9.

          Show
          Uwe Schindler added a comment - Move issue to Solr 4.9.

            People

            • Assignee:
              Mark Miller
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development