Solr
  1. Solr
  2. SOLR-7361

Main Jetty thread blocked by core loading delays HTTP listener from binding if core loading is slow

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.3, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      During server startup, the CoreContainer uses an ExecutorService to load cores in multiple back-ground threads but then blocks until cores are loaded, see: CoreContainer#load around line 290 on trunk (invokeAll). From the JavaDoc on that method, we have:

      Executes the given tasks, returning a list of Futures holding their status and results when all complete. Future.isDone() is true for each element of the returned list.

      In other words, this is a blocking call.

      This delays the Jetty HTTP listener from binding and accepting requests until all cores are loaded. Do we need to block the main thread?

      Also, prior to this happening, the node is registered as a live node in ZK, which makes it a candidate for receiving requests from the Overseer, such as to service a create collection request. The problem of course is that the node listed in /live_nodes isn't accepting requests yet. So we either need to unblock the main thread during server loading or maybe wait longer before we register as a live node ... not sure which is the better way forward?

      1. SOLR-7361.patch
        33 kB
        Mark Miller
      2. SOLR-7361.patch
        32 kB
        Mark Miller
      3. SOLR-7361.patch
        32 kB
        Mark Miller
      4. SOLR-7361.patch
        15 kB
        Mark Miller
      5. SOLR-7361.patch
        14 kB
        Mark Miller
      6. SOLR-7361.patch
        14 kB
        Mark Miller
      7. SOLR-7361.patch
        9 kB
        Mark Miller
      8. SOLR-7361.patch
        8 kB
        Timothy Potter

        Issue Links

          Activity

          Hide
          Ramkumar Aiyengar added a comment -

          Isn't it a problem that live_nodes is set up but the cores aren't, only when the core isn't marked down? I see currently that ZkController.preRegister happens async in the threadpool, if that happened synchronously, we should be free to block main jetty thread or free it up as we deem fit. At that point, might as well free it up, but I admit I am not an expert on if there are other interactions we need to worry about..

          Show
          Ramkumar Aiyengar added a comment - Isn't it a problem that live_nodes is set up but the cores aren't, only when the core isn't marked down? I see currently that ZkController.preRegister happens async in the threadpool, if that happened synchronously, we should be free to block main jetty thread or free it up as we deem fit. At that point, might as well free it up, but I admit I am not an expert on if there are other interactions we need to worry about..
          Hide
          Timothy Potter added a comment -

          Isn't it a problem that live_nodes is set up but the cores aren't, only when the core isn't marked down?

          You're right about requests being sent to cores, but I think the problem here is admin type requests, such as those coming from the overseer to create a new collection. This is actually how we found this in a big cluster after a restart, some cores were slow to load (due the the suggester dictionary issue) and create collection requests started to fail.

          In general, I don't think all cores should be blocked from being accessed until the slowest core is loaded, so I'm thinking we need to re-think how cores are loaded in the background in cloud mode.

          Show
          Timothy Potter added a comment - Isn't it a problem that live_nodes is set up but the cores aren't, only when the core isn't marked down? You're right about requests being sent to cores, but I think the problem here is admin type requests, such as those coming from the overseer to create a new collection. This is actually how we found this in a big cluster after a restart, some cores were slow to load (due the the suggester dictionary issue) and create collection requests started to fail. In general, I don't think all cores should be blocked from being accessed until the slowest core is loaded, so I'm thinking we need to re-think how cores are loaded in the background in cloud mode.
          Hide
          Jessica Cheng Mallet added a comment -

          I think I have also seen cases where if we bounced two nodes holding two replicas of a particular collection/shard, then they both can't complete their recovery because they can't talk to each other. This fixes itself eventually when they time out waiting for each other, but before that happens they're basically "deadlocked". (Unfortunately I don't have logs to back that up anymore, so it's more of an anecdotal account.)

          Show
          Jessica Cheng Mallet added a comment - I think I have also seen cases where if we bounced two nodes holding two replicas of a particular collection/shard, then they both can't complete their recovery because they can't talk to each other. This fixes itself eventually when they time out waiting for each other, but before that happens they're basically "deadlocked". (Unfortunately I don't have logs to back that up anymore, so it's more of an anecdotal account.)
          Hide
          Ramkumar Aiyengar added a comment -

          Tim, I understand. What I meant is that we can't currently let through any requests because you are not guaranteed that cores will be in a sensible state for accepting requests before the core loading threads finish. That's why at least the preRegister for all cores needs to finish (and that shouldn't take long) before we start accepting requests. Once that happens, I think cores can load up in the background (or at least we can try, stumble and fix )

          Show
          Ramkumar Aiyengar added a comment - Tim, I understand. What I meant is that we can't currently let through any requests because you are not guaranteed that cores will be in a sensible state for accepting requests before the core loading threads finish. That's why at least the preRegister for all cores needs to finish (and that shouldn't take long) before we start accepting requests. Once that happens, I think cores can load up in the background (or at least we can try, stumble and fix )
          Hide
          Timothy Potter added a comment -

          Ramkumar Aiyengar ah ok, I misunderstood, what you're saying makes sense. I'll try to tackle this on Thursday or Friday this week as I'm focused on a couple of other issues first, so if someone else can pick this up sooner, that would be great.

          Show
          Timothy Potter added a comment - Ramkumar Aiyengar ah ok, I misunderstood, what you're saying makes sense. I'll try to tackle this on Thursday or Friday this week as I'm focused on a couple of other issues first, so if someone else can pick this up sooner, that would be great.
          Hide
          Mark Miller added a comment -

          what you're saying makes sense.

          +1

          Show
          Mark Miller added a comment - what you're saying makes sense. +1
          Hide
          Ramkumar Aiyengar added a comment -

          preRegister for all cores needs to finish (that shouldn't take long) before we start accepting requests.

          This should become even faster btw once we get to SOLR-7281.

          Jessica: I have a feeling what you are saying might happen because of the same reason as above – if two nodes go down uncleanly as active, they would come back as live and active before the async core loading actually marks them as down.

          This is the sequence I propose (again, without looking at the code too deeply, so I might have missed something)

          • Mark all cores down, also make sure direct requests from outside Solr for all cores suitably fail.
          • Add to live nodes
          • Start accepting requests
          • Load cores in background and enable requests for cores as they finish loading.
          Show
          Ramkumar Aiyengar added a comment - preRegister for all cores needs to finish (that shouldn't take long) before we start accepting requests. This should become even faster btw once we get to SOLR-7281 . Jessica: I have a feeling what you are saying might happen because of the same reason as above – if two nodes go down uncleanly as active, they would come back as live and active before the async core loading actually marks them as down. This is the sequence I propose (again, without looking at the code too deeply, so I might have missed something) Mark all cores down, also make sure direct requests from outside Solr for all cores suitably fail. Add to live nodes Start accepting requests Load cores in background and enable requests for cores as they finish loading.
          Hide
          Timothy Potter added a comment -

          heh - now tons of tests fail because the main thread isn't blocked and finishes before cores are initialized ... knee-jerk reaction would be to activate this true async loading method using a System property, e.g. solr.cloud.loadCoresAsync, which defaults to true but is set to false in most tests ... other ideas?

          Show
          Timothy Potter added a comment - heh - now tons of tests fail because the main thread isn't blocked and finishes before cores are initialized ... knee-jerk reaction would be to activate this true async loading method using a System property, e.g. solr.cloud.loadCoresAsync, which defaults to true but is set to false in most tests ... other ideas?
          Hide
          Shawn Heisey added a comment -

          I like the idea of having access to the admin UI responsive immediately on startup ... if nothing else, to be able to see the progress of the startup.

          Obviously any requests to cores that are not yet started must fail, but cores that have started, as well as any functionality that doesn't depend on core startup, should work.

          There's probably more to think about and tweak. During startup, CoreAdmin and CollectionsAdmin requests will require careful vetting. My thought here is that about the only operation that should be allowed on things that haven't yet started is DELETE, if we can find a way to do so safely. If deleting a core can remove it from the startup queue entirely, it's a slight optimization.

          If we aren't already doing so, this will require that we enumerate all the cores and/or the clusterstate before listening to any kind of request on CoreAdmin/CollectionsAdmin.

          A disabling property for tests is a good interim step, but I think that the long term goal should be to modify both the async startup and the tests so everything works right.

          Show
          Shawn Heisey added a comment - I like the idea of having access to the admin UI responsive immediately on startup ... if nothing else, to be able to see the progress of the startup. Obviously any requests to cores that are not yet started must fail, but cores that have started, as well as any functionality that doesn't depend on core startup, should work. There's probably more to think about and tweak. During startup, CoreAdmin and CollectionsAdmin requests will require careful vetting. My thought here is that about the only operation that should be allowed on things that haven't yet started is DELETE, if we can find a way to do so safely. If deleting a core can remove it from the startup queue entirely, it's a slight optimization. If we aren't already doing so, this will require that we enumerate all the cores and/or the clusterstate before listening to any kind of request on CoreAdmin/CollectionsAdmin. A disabling property for tests is a good interim step, but I think that the long term goal should be to modify both the async startup and the tests so everything works right.
          Hide
          Timothy Potter added a comment -

          Actually, if we go with the idea that non-cloud mode continues to block the main thread until cores are loaded, then all tests pass. I think that makes sense anyway, since there isn't anything like replica state in non-cloud mode to determine if a core is active, so the listener not being up until cores are loaded seems like a good thing for non-cloud mode. For cloud mode, it appears that we don't need a flag for the tests to pass.

          Show
          Timothy Potter added a comment - Actually, if we go with the idea that non-cloud mode continues to block the main thread until cores are loaded, then all tests pass. I think that makes sense anyway, since there isn't anything like replica state in non-cloud mode to determine if a core is active, so the listener not being up until cores are loaded seems like a good thing for non-cloud mode. For cloud mode, it appears that we don't need a flag for the tests to pass.
          Hide
          Shawn Heisey added a comment -

          If we can pull the same thing off in non-cloud mode, I would like that ... but cloud mode is where it's a big problem.

          Show
          Shawn Heisey added a comment - If we can pull the same thing off in non-cloud mode, I would like that ... but cloud mode is where it's a big problem.
          Hide
          Damien Kamerman added a comment -

          Regarding loading the cores in the background, I've made a few tweaks to work with thousands of cores (See SOLR-7191):
          1. Load cores in fixed threadPool. Cores are registered in background threads, so no reason to load all cores at once!
          2. Register cores in corename order with a fixed 128 threadPool. This is to not flood the DistributedQueue.
          3. Publish an entire node as 'down' (4.10 branch)
          4. Cache ConfigSetService.createIndexSchema() in cloud mode.

          So, my questions are:
          What threadPool size will be used to load the cores?
          What order will cores be loaded in?
          Will cores be registered as they are loaded, or offloaded to another threadPool?

          My initial thought was to register as a live node after cores are loaded.

          Show
          Damien Kamerman added a comment - Regarding loading the cores in the background, I've made a few tweaks to work with thousands of cores (See SOLR-7191 ): 1. Load cores in fixed threadPool. Cores are registered in background threads, so no reason to load all cores at once! 2. Register cores in corename order with a fixed 128 threadPool. This is to not flood the DistributedQueue. 3. Publish an entire node as 'down' (4.10 branch) 4. Cache ConfigSetService.createIndexSchema() in cloud mode. So, my questions are: What threadPool size will be used to load the cores? What order will cores be loaded in? Will cores be registered as they are loaded, or offloaded to another threadPool? My initial thought was to register as a live node after cores are loaded.
          Hide
          Ramkumar Aiyengar added a comment -

          May be provide for a method in CoreContainer to wait till cores are initialized?

          Show
          Ramkumar Aiyengar added a comment - May be provide for a method in CoreContainer to wait till cores are initialized?
          Hide
          Timothy Potter added a comment -

          damien kamerman All good questions, but this ticket is not intended to address any of those and it sounds like you're tackling them as part of SOLR-7191. I'm close to posting a patch for this, which will only address the problem of blocking the main thread (which activates the Jetty listener) during core loading.

          Show
          Timothy Potter added a comment - damien kamerman All good questions, but this ticket is not intended to address any of those and it sounds like you're tackling them as part of SOLR-7191 . I'm close to posting a patch for this, which will only address the problem of blocking the main thread (which activates the Jetty listener) during core loading.
          Hide
          Timothy Potter added a comment -

          Patch that blocks the main thread until cores are pre-registered with ZK, but then loads cores in the background, allowing the main thread to progress and activate the Jetty listener. Only works in SolrCloud mode; the main thread will continue to be blocked by core loading in non-cloud mode. Cores will come online asynchronously when they are loaded.

          Show
          Timothy Potter added a comment - Patch that blocks the main thread until cores are pre-registered with ZK, but then loads cores in the background, allowing the main thread to progress and activate the Jetty listener. Only works in SolrCloud mode; the main thread will continue to be blocked by core loading in non-cloud mode. Cores will come online asynchronously when they are loaded.
          Hide
          Ramkumar Aiyengar added a comment -

          Looks like a good start..

          • What does a client get back when sending a request to a core which is not yet loaded? It should be a sensible 5xx error, but would be good to test that..
          • Do we still need the sysprop for tests? As I mention above, we could use a CoreContainer call to synchronize ourselves in tests..
          Show
          Ramkumar Aiyengar added a comment - Looks like a good start.. What does a client get back when sending a request to a core which is not yet loaded? It should be a sensible 5xx error, but would be good to test that.. Do we still need the sysprop for tests? As I mention above, we could use a CoreContainer call to synchronize ourselves in tests..
          Hide
          Mark Miller added a comment -

          solr.cloud.loadCoresAsync, which defaults to true but is set to false in most tests ... other ideas?

          That doesn't seem right - we end up testing the wrong stuff mostly then - I'm working on a similar problem - let me see if I can get something better.

          Show
          Mark Miller added a comment - solr.cloud.loadCoresAsync, which defaults to true but is set to false in most tests ... other ideas? That doesn't seem right - we end up testing the wrong stuff mostly then - I'm working on a similar problem - let me see if I can get something better.
          Hide
          Mark Miller added a comment -

          What does a client get back when sending a request to a core which is not yet loaded? It should be a sensible 5xx error, but would be good to test that..

          If it doesn't block, it will be a behavior back compat break we have to think about. I don't want to make this an option, so just simulating the block may be one way to go.

          Show
          Mark Miller added a comment - What does a client get back when sending a request to a core which is not yet loaded? It should be a sensible 5xx error, but would be good to test that.. If it doesn't block, it will be a behavior back compat break we have to think about. I don't want to make this an option, so just simulating the block may be one way to go.
          Hide
          Mark Miller added a comment -

          I breezed over this a while back and it didn't stick with me. Ran into a situation I needed to solve myself and started tackling this how I would approach it in SOLR-7416. Here is the patch I hacked out. I would actually still like to wait for quick loading cores in most cases for up to N seconds and just have slow loading stragglers come in later. I'd also like to avoid any switch around the behaviour if possible. We have enough complication and varying code paths in this area.

          Show
          Mark Miller added a comment - I breezed over this a while back and it didn't stick with me. Ran into a situation I needed to solve myself and started tackling this how I would approach it in SOLR-7416 . Here is the patch I hacked out. I would actually still like to wait for quick loading cores in most cases for up to N seconds and just have slow loading stragglers come in later. I'd also like to avoid any switch around the behaviour if possible. We have enough complication and varying code paths in this area.
          Hide
          Mark Miller added a comment -

          Still a beta, but here is patch v2.

          Show
          Mark Miller added a comment - Still a beta, but here is patch v2.
          Hide
          Mark Miller added a comment -

          Here is a good version for comment with passing tests.

          Note: I had to fix an issue with how UpdateLog gets closed. Been trying to track that one down for a while I think!

          Show
          Mark Miller added a comment - Here is a good version for comment with passing tests. Note: I had to fix an issue with how UpdateLog gets closed. Been trying to track that one down for a while I think!
          Hide
          Timothy Potter added a comment -

          Thanks for picking this one up. I like this approach better as well, since it avoids the cloud vs. non-cloud switch. And sorry about leaving that flag in there on my last patch as that caused some confusion for you and Ram (it wasn't being used anymore). I'll kick the tires on this patch a bit myself but +1 to the approach.

          Show
          Timothy Potter added a comment - Thanks for picking this one up. I like this approach better as well, since it avoids the cloud vs. non-cloud switch. And sorry about leaving that flag in there on my last patch as that caused some confusion for you and Ram (it wasn't being used anymore). I'll kick the tires on this patch a bit myself but +1 to the approach.
          Hide
          Timothy Potter added a comment -

          Assigning to Mark as he's proposed a better solution for this moving forward.

          Show
          Timothy Potter added a comment - Assigning to Mark as he's proposed a better solution for this moving forward.
          Hide
          Mark Miller added a comment -

          Here is another iteration that polishes up some areas.

          Show
          Mark Miller added a comment - Here is another iteration that polishes up some areas.
          Hide
          Mark Miller added a comment -

          This does change behavior a bit of course - all of your cores were loaded when you came out of CC#load before. I think it does a good enough job of simulating the old behavior though. Attempting to access a loading core specifically will block until it's ready to serve. I think given that this does seem like a big enough limitation that it enters bug territory, it's well worth making the change.

          Show
          Mark Miller added a comment - This does change behavior a bit of course - all of your cores were loaded when you came out of CC#load before. I think it does a good enough job of simulating the old behavior though. Attempting to access a loading core specifically will block until it's ready to serve. I think given that this does seem like a big enough limitation that it enters bug territory, it's well worth making the change.
          Hide
          Ramkumar Aiyengar added a comment -

          Sorry for the late feedback.

          My concern with this approach is that this can be trappy for client code. When you are developing/testing, you might always have the cores up and running by the time you issue requests – and then the edge case will be exposed one day on production when slow disk IO or a bad GC pause strikes..

          If we are breaking compat anyway, why not always break it (instead of in some cases outside developer control – which imo is no better) and return immediately without sleeping? In any case, at the very minimum, we would want to do that in our tests so that we account for that situation..

          Show
          Ramkumar Aiyengar added a comment - Sorry for the late feedback. My concern with this approach is that this can be trappy for client code. When you are developing/testing, you might always have the cores up and running by the time you issue requests – and then the edge case will be exposed one day on production when slow disk IO or a bad GC pause strikes.. If we are breaking compat anyway, why not always break it (instead of in some cases outside developer control – which imo is no better) and return immediately without sleeping? In any case, at the very minimum, we would want to do that in our tests so that we account for that situation..
          Hide
          Mark Miller added a comment -

          you might always have the cores up and running by the time you issue requests

          Other requests will block until the core is ready - what will that break? That is how it works now if requests come in before the cores are loaded. Do you have a concrete example?

          If we are breaking compat anyway, why not always break it

          If the cores are just going to load quickly, I think it's simply nicer to wait for them to load and only have outliers straggle in later, as the system did work. Unless there are outliers, I think it's a nicer system experience. This is a simple simulation of that approach.

          It doesn't matter how long that wait is - you can set it to 0 and all the tests will pass fine. You can certainly randomly set that to 0 in test runs.

          Show
          Mark Miller added a comment - you might always have the cores up and running by the time you issue requests Other requests will block until the core is ready - what will that break? That is how it works now if requests come in before the cores are loaded. Do you have a concrete example? If we are breaking compat anyway, why not always break it If the cores are just going to load quickly, I think it's simply nicer to wait for them to load and only have outliers straggle in later, as the system did work. Unless there are outliers, I think it's a nicer system experience. This is a simple simulation of that approach. It doesn't matter how long that wait is - you can set it to 0 and all the tests will pass fine. You can certainly randomly set that to 0 in test runs.
          Hide
          Ramkumar Aiyengar added a comment -

          I haven't synthetically tried this to confirm, happy to do that, I am just away from keyboard this instant. But here's my concern..

          Before: I would have got a 5xx error for unloaded cores since all load at once before Jetty binds, which every client should consider normal in a distributed setup at least and try alternatives.

          Now: I would block till the loading is done. This could trigger timeouts which usually is not normal (would usually indicate slow queries), or worse, backlog requests in a busy server if its using a threadpool and sync http requests – even if timeouts are not hit.

          Again, my understanding might be off, let me confirm this soon, and I can update if I am spewing nonsense..

          Show
          Ramkumar Aiyengar added a comment - I haven't synthetically tried this to confirm, happy to do that, I am just away from keyboard this instant. But here's my concern.. Before: I would have got a 5xx error for unloaded cores since all load at once before Jetty binds, which every client should consider normal in a distributed setup at least and try alternatives. Now: I would block till the loading is done. This could trigger timeouts which usually is not normal (would usually indicate slow queries), or worse, backlog requests in a busy server if its using a threadpool and sync http requests – even if timeouts are not hit. Again, my understanding might be off, let me confirm this soon, and I can update if I am spewing nonsense..
          Hide
          Mark Miller added a comment -

          Before: I would have got a 5xx error for unloaded cores since all load at once before Jetty binds

          Would you? I thought the request just blocked until the SolrDispatchFilter was ready...we should confirm it's a 5xx error - if that is the case, that is what we should be emulating.

          Show
          Mark Miller added a comment - Before: I would have got a 5xx error for unloaded cores since all load at once before Jetty binds Would you? I thought the request just blocked until the SolrDispatchFilter was ready...we should confirm it's a 5xx error - if that is the case, that is what we should be emulating.
          Hide
          Mark Miller added a comment -

          Okay, with SolrCloud, if you hit a core that is loaded and one is not, you get:

          <?xml version="1.0" encoding="UTF-8"?>
          <response>
          <lst name="responseHeader"><int name="status">503</int><int name="QTime">3</int><lst name="params"/></lst><lst name="error"><str name="msg">no servers hosting shard: </str><int name="code">503</int></lst>
          </response>
          

          If you hit the core that is loading you get:

          curl: (7) Failed to connect to localhost port 8902: Connection refused

          I'll work up a new patch.

          Show
          Mark Miller added a comment - Okay, with SolrCloud, if you hit a core that is loaded and one is not, you get: <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">503</int><int name="QTime">3</int><lst name="params"/></lst><lst name="error"><str name="msg">no servers hosting shard: </str><int name="code">503</int></lst> </response> If you hit the core that is loading you get: curl: (7) Failed to connect to localhost port 8902: Connection refused I'll work up a new patch.
          Hide
          Mark Miller added a comment -

          Made some good progress this afternoon. I'll post a patch relatively soon.

          Show
          Mark Miller added a comment - Made some good progress this afternoon. I'll post a patch relatively soon.
          Hide
          Mark Miller added a comment -

          Here is my current progress.

          By default CoreContainer still waits on load, but has a new async load option.

          JettySolrServer still waits on load, but has a new async load option.

          SolrDispatchFilter turns on the async load option and also returns a 503 on request while a core is loading (though it seems perhaps you can get a 510 instead depending on timing due to the stateformat=2 stuff).

          I think that is back compat and gives us the new behavior we want.

          Show
          Mark Miller added a comment - Here is my current progress. By default CoreContainer still waits on load, but has a new async load option. JettySolrServer still waits on load, but has a new async load option. SolrDispatchFilter turns on the async load option and also returns a 503 on request while a core is loading (though it seems perhaps you can get a 510 instead depending on timing due to the stateformat=2 stuff). I think that is back compat and gives us the new behavior we want.
          Hide
          Mark Miller added a comment -

          Any comments on the latest patch?

          Show
          Mark Miller added a comment - Any comments on the latest patch?
          Hide
          Ramkumar Aiyengar added a comment -

          Sorry I started looking at this over weekend before getting distracted. +1 overall.

          • Does your test in TestMiniCloud assume that by the time you assert the loading wouldn't have async done? Seems like we are relying on a timing detail. May be we could at least check for if the core is loaded at that point (still a race since it won't be an atomic check but probably less likely to hit).
          • Minor: Would be nice for JettyConfig to instead take a Integer waitForLoadingMs (can be null) so that it can be tweaked instead of a boolean. Its hard anyway to assume a default for this.
          Show
          Ramkumar Aiyengar added a comment - Sorry I started looking at this over weekend before getting distracted. +1 overall. Does your test in TestMiniCloud assume that by the time you assert the loading wouldn't have async done? Seems like we are relying on a timing detail. May be we could at least check for if the core is loaded at that point (still a race since it won't be an atomic check but probably less likely to hit). Minor: Would be nice for JettyConfig to instead take a Integer waitForLoadingMs (can be null) so that it can be tweaked instead of a boolean. Its hard anyway to assume a default for this.
          Hide
          Timothy Potter added a comment -

          +1 LGTM ... although the recent dispatch filter refactoring has broken the latest patch on trunk

          Show
          Timothy Potter added a comment - +1 LGTM ... although the recent dispatch filter refactoring has broken the latest patch on trunk
          Hide
          Mark Miller added a comment -

          Regaining a bit of use of my right hand for hunt and peck - I'll see if I can update this on Sunday.

          Show
          Mark Miller added a comment - Regaining a bit of use of my right hand for hunt and peck - I'll see if I can update this on Sunday.
          Hide
          Mark Miller added a comment -

          May be we could at least check for if the core is loaded at that point

          Just another timing detail with false fails?

          Show
          Mark Miller added a comment - May be we could at least check for if the core is loaded at that point Just another timing detail with false fails?
          Hide
          Mark Miller added a comment -

          Updated to trunk.

          Show
          Mark Miller added a comment - Updated to trunk.
          Hide
          Mark Miller added a comment -

          I have to work out why TestMiniSolrCloudCluster#testBasics is failing due to 510, stale cluster state responses. Something off in the merge up or some change I pulled in altered behavior.

          Show
          Mark Miller added a comment - I have to work out why TestMiniSolrCloudCluster#testBasics is failing due to 510, stale cluster state responses. Something off in the merge up or some change I pulled in altered behavior.
          Hide
          Mark Miller added a comment -

          Just another timing detail with false fails?

          I think I was misunderstanding. You are saying only do the check if the core is not loaded by then? That almost seems like we are just adding nothing to that test then - it wouldn't fail without the change like the current one does. I think if we start seeing false fails in the jenkins cluster, we should do something more invasive.

          Show
          Mark Miller added a comment - Just another timing detail with false fails? I think I was misunderstanding. You are saying only do the check if the core is not loaded by then? That almost seems like we are just adding nothing to that test then - it wouldn't fail without the change like the current one does. I think if we start seeing false fails in the jenkins cluster, we should do something more invasive.
          Hide
          Mark Miller added a comment -

          Cleans up the remaining issues except the quality of the test.

          Show
          Mark Miller added a comment - Cleans up the remaining issues except the quality of the test.
          Hide
          Ramkumar Aiyengar added a comment -

          It would still check if you correctly return a 5xx when you are not loaded, but I see your point. In order to properly test for the async nature, you would need to pass down a latch which prevents completion till you release it from the test, aka the invasive bit you refer to Okay, +1 from me then..

          Show
          Ramkumar Aiyengar added a comment - It would still check if you correctly return a 5xx when you are not loaded, but I see your point. In order to properly test for the async nature, you would need to pass down a latch which prevents completion till you release it from the test, aka the invasive bit you refer to Okay, +1 from me then..
          Hide
          ASF subversion and git services added a comment -

          Commit 1682060 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1682060 ]

          SOLR-7361: Slow loading SolrCores should not hold up all other SolrCores that have finished loading from serving requests.

          Show
          ASF subversion and git services added a comment - Commit 1682060 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1682060 ] SOLR-7361 : Slow loading SolrCores should not hold up all other SolrCores that have finished loading from serving requests.
          Hide
          ASF subversion and git services added a comment -

          Commit 1682065 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1682065 ]

          SOLR-7361: Slow loading SolrCores should not hold up all other SolrCores that have finished loading from serving requests.

          Show
          ASF subversion and git services added a comment - Commit 1682065 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1682065 ] SOLR-7361 : Slow loading SolrCores should not hold up all other SolrCores that have finished loading from serving requests.
          Hide
          Mark Miller added a comment -

          Thanks all! This is a great improvement.

          Show
          Mark Miller added a comment - Thanks all! This is a great improvement.
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close for 5.3.0 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

            People

            • Assignee:
              Mark Miller
              Reporter:
              Timothy Potter
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development