Solr
  1. Solr
  2. SOLR-5986

Don't allow runaway queries from harming Solr cluster health or search performance

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0
    • Component/s: search
    • Labels:
      None

      Description

      The intent of this ticket is to have all distributed search requests stop wasting CPU cycles on requests that have already timed out or are so complicated that they won't be able to execute. We have come across a case where a nasty wildcard query within a proximity clause was causing the cluster to enumerate terms for hours even though the query timeout was set to minutes. This caused a noticeable slowdown within the system which made us restart the replicas that happened to service that one request, the worst case scenario are users with a relatively low zk timeout value will have nodes start dropping from the cluster due to long GC pauses.

      Aaron McCurry Built a mechanism into Apache Blur to help with the issue in BLUR-142 (see commit comment for code, though look at the latest code on the trunk for newer bug fixes).

      Solr should be able to either prevent these problematic queries from running by some heuristic (possibly estimated size of heap usage) or be able to execute a thread interrupt on all query threads once the time threshold is met. This issue mirrors what others have discussed on the mailing list: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3C856ac15f0903272054q2dbdbd19kea3c5ba9e105b9d8@mail.gmail.com%3E

      1. SOLR-5986.patch
        58 kB
        Anshum Gupta
      2. SOLR-5986.patch
        59 kB
        Steve Rowe
      3. SOLR-5986.patch
        47 kB
        Anshum Gupta
      4. SOLR-5986.patch
        47 kB
        Anshum Gupta
      5. SOLR-5986.patch
        36 kB
        Anshum Gupta
      6. SOLR-5986.patch
        40 kB
        Anshum Gupta
      7. SOLR-5986.patch
        40 kB
        Anshum Gupta
      8. SOLR-5986.patch
        39 kB
        Anshum Gupta
      9. SOLR-5986.patch
        20 kB
        Anshum Gupta
      10. SOLR-5986.patch
        23 kB
        Anshum Gupta
      11. SOLR-5986.patch
        24 kB
        Anshum Gupta
      12. SOLR-5986.patch
        22 kB
        Anshum Gupta
      13. SOLR-5986-fixtests.patch
        13 kB
        Anshum Gupta
      14. SOLR-5986-fixtests.patch
        2 kB
        Anshum Gupta
      15. SOLR-5986-fixtests.patch
        3 kB
        Anshum Gupta

        Issue Links

          Activity

          Hide
          Steve Davids added a comment -

          As a follow up, we are still experiencing frequent issues with this specific issue which is getting more and more frequent. Upon further research it looks like this is a somewhat common problem that afflicts various Lucene community members. As noted in the description Apache Blur has implemented a mechanism for coping but more recently Elastic Search has also implemented their own solution which performs an up-front query heap estimation and will pull the "circuit breaker" if it exceeds a threshold, thus not allowing the query to crash their cluster.

          Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html#fielddata-circuit-breaker
          Ticket: https://github.com/elasticsearch/elasticsearch/issues/2929 & https://github.com/elasticsearch/elasticsearch/pull/4261

          If anyone has any suggestions on how we can limp by for the time being that would also be greatly appreciated (unfortunately our user base needs to keep using nested proximity wildcards but willing to have mechanisms in place to a kill subset of problematic queries).

          Show
          Steve Davids added a comment - As a follow up, we are still experiencing frequent issues with this specific issue which is getting more and more frequent. Upon further research it looks like this is a somewhat common problem that afflicts various Lucene community members. As noted in the description Apache Blur has implemented a mechanism for coping but more recently Elastic Search has also implemented their own solution which performs an up-front query heap estimation and will pull the "circuit breaker" if it exceeds a threshold, thus not allowing the query to crash their cluster. Documentation: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html#fielddata-circuit-breaker Ticket: https://github.com/elasticsearch/elasticsearch/issues/2929 & https://github.com/elasticsearch/elasticsearch/pull/4261 If anyone has any suggestions on how we can limp by for the time being that would also be greatly appreciated (unfortunately our user base needs to keep using nested proximity wildcards but willing to have mechanisms in place to a kill subset of problematic queries).
          Hide
          Anshum Gupta added a comment -

          I was trying to look at this. As far as I see, the 2 linked issues here are 2 completely different takes on handling compute intensive/resource sucking queries. The BLUR issue talks about killing the queries whereas the ES issue tries to do some heap estimation up-front.

          To get things clear, we're only looking at this from the perspective of maintaining cluster health and are not worried about partial results. Correct me if I'm wrong on that one.

          Show
          Anshum Gupta added a comment - I was trying to look at this. As far as I see, the 2 linked issues here are 2 completely different takes on handling compute intensive/resource sucking queries. The BLUR issue talks about killing the queries whereas the ES issue tries to do some heap estimation up-front. To get things clear, we're only looking at this from the perspective of maintaining cluster health and are not worried about partial results. Correct me if I'm wrong on that one.
          Hide
          Steve Davids added a comment -

          In an ideal world it would attempt to provide results for the shards that may be okay, but the end goal is to maintain the health of the cluster for queries that get out of hand. If you can know up front that there is no possible way that a query could complete then it would be reasonable to error out immediately (though that metric may be squishy to know if it will/will not complete). Hopefully that makes sense...

          Show
          Steve Davids added a comment - In an ideal world it would attempt to provide results for the shards that may be okay, but the end goal is to maintain the health of the cluster for queries that get out of hand. If you can know up front that there is no possible way that a query could complete then it would be reasonable to error out immediately (though that metric may be squishy to know if it will/will not complete). Hopefully that makes sense...
          Hide
          Jim Walker added a comment -

          Steve, I wonder why you would have to restart the replica? I presume this is because that is your only recourse to stop a query that might take days to complete?

          If a query takes that long and is ignoring a specified timeout, that seems like it's own issue that needs resolution.

          IMHO, the primary goal should be to make SolrCloud clusters more resilient to performance degradations caused by such nasty queries described above.

          The circuit-breaker approach in the linked ES tickets is clever, but it does not seem to be as generally applicable as the ability to view all running queries with an option to stop them. For example, it seems the linked ES circuit breaker will only trigger for issues deriving from loading too much field data. The problem described above may result from this cause, or any number of other causes.

          My preference would be to have a response mechanism that 1) applies broadly and 2) a dev-ops guy can execute in a UI like Solr Admin, or even by API.

          Show
          Jim Walker added a comment - Steve, I wonder why you would have to restart the replica? I presume this is because that is your only recourse to stop a query that might take days to complete? If a query takes that long and is ignoring a specified timeout, that seems like it's own issue that needs resolution. IMHO, the primary goal should be to make SolrCloud clusters more resilient to performance degradations caused by such nasty queries described above. The circuit-breaker approach in the linked ES tickets is clever, but it does not seem to be as generally applicable as the ability to view all running queries with an option to stop them. For example, it seems the linked ES circuit breaker will only trigger for issues deriving from loading too much field data. The problem described above may result from this cause, or any number of other causes. My preference would be to have a response mechanism that 1) applies broadly and 2) a dev-ops guy can execute in a UI like Solr Admin, or even by API.
          Hide
          Steve Davids added a comment -

          I wonder why you would have to restart the replica? I presume this is because that is your only recourse to stop a query that might take days to complete?

          Yes, that is correct, that is the easiest way to kill a run-away thread.

          If a query takes that long and is ignoring a specified timeout, that seems like it's own issue that needs resolution.

          The Solr instance that is distributing the requests to other shards honors the timeout value and stops the collection process once the threshold is met (and returns to the client with partial results if any are available), though the queries remain running on all of the shards that were initially searched in the overall distributed request. If the timeout value is honored on each shard that was used in the distributed request that would probably take care of the problem.

          IMHO, the primary goal should be to make SolrCloud clusters more resilient to performance degradations caused by such nasty queries described above.

          +1 resiliency to performance degradations is always a good thing

          The circuit-breaker approach in the linked ES tickets is clever, but it does not seem to be as generally applicable as the ability to view all running queries with an option to stop them.

          +1 I actually prefer the BLUR route, though being able to see the current queries plus the ability to kill them off across the cluster would be great. Although it is crucial to be able to automatically have queries be killed off after a certain threshold (ideally the timeout value). This is necessary because I don't want to be monitoring the Solr admin page at all hours during the day (though I could create scripts to do the work if an API call is available, but not preferred).

          My preference would be to have a response mechanism that 1) applies broadly and 2) a dev-ops guy can execute in a UI like Solr Admin, or even by API.

          +1 if "applied broadly" means ability to specify a threshold to start killing off queries.

          Show
          Steve Davids added a comment - I wonder why you would have to restart the replica? I presume this is because that is your only recourse to stop a query that might take days to complete? Yes, that is correct, that is the easiest way to kill a run-away thread. If a query takes that long and is ignoring a specified timeout, that seems like it's own issue that needs resolution. The Solr instance that is distributing the requests to other shards honors the timeout value and stops the collection process once the threshold is met (and returns to the client with partial results if any are available), though the queries remain running on all of the shards that were initially searched in the overall distributed request. If the timeout value is honored on each shard that was used in the distributed request that would probably take care of the problem. IMHO, the primary goal should be to make SolrCloud clusters more resilient to performance degradations caused by such nasty queries described above. +1 resiliency to performance degradations is always a good thing The circuit-breaker approach in the linked ES tickets is clever, but it does not seem to be as generally applicable as the ability to view all running queries with an option to stop them. +1 I actually prefer the BLUR route, though being able to see the current queries plus the ability to kill them off across the cluster would be great. Although it is crucial to be able to automatically have queries be killed off after a certain threshold (ideally the timeout value). This is necessary because I don't want to be monitoring the Solr admin page at all hours during the day (though I could create scripts to do the work if an API call is available, but not preferred). My preference would be to have a response mechanism that 1) applies broadly and 2) a dev-ops guy can execute in a UI like Solr Admin, or even by API. +1 if "applied broadly" means ability to specify a threshold to start killing off queries.
          Hide
          Jim Walker added a comment -

          Steve, good point regarding automation. That should come first.

          I just talked to Anshum who has this covered; he will know what needs to happen here best. Cheers

          Show
          Jim Walker added a comment - Steve, good point regarding automation. That should come first. I just talked to Anshum who has this covered; he will know what needs to happen here best. Cheers
          Hide
          Anshum Gupta added a comment -

          I'm trying to solve this problem a little differently. Just for the curious (before there's a patch), I intend to have a tracking thread that'd look up a list of requests at the core level and interrupt the requests that have timed out. I'm hoping that lucene would play well with it and respect the interrupt too solving the problem of a query running for hours/days.
          That would keep this change at Solr level without being really intrusive and requiring change in all TermsEnum implementations out there.

          Show
          Anshum Gupta added a comment - I'm trying to solve this problem a little differently. Just for the curious (before there's a patch), I intend to have a tracking thread that'd look up a list of requests at the core level and interrupt the requests that have timed out. I'm hoping that lucene would play well with it and respect the interrupt too solving the problem of a query running for hours/days. That would keep this change at Solr level without being really intrusive and requiring change in all TermsEnum implementations out there.
          Hide
          Steve Davids added a comment -

          There doesn't appear to be any Lucene code that is specifically honoring a thread interrupt, so if Solr/Lucene is busy enumerating terms in a continual for loop, sending an interrupt won't actually do anything. The Java code needs to check if the thread has been interrupted, if so, then bail on the current process.

          Blur does this by creating their own "ExitableTerms", "ExitableTermsEnum", etc where every time the enum next method is called, it will check to see if the thread has been interrupted, if it is then an exception is thrown which halts processing of the query. https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-store/src/main/java/org/apache/blur/index/ExitableReader.java;h=8321dd27d3537ee239f876448e56e8296407700b;hb=61480125dee51c469a4921004f6daf590410bca6

          Performing the thread interrupt check within Lucene seems reasonable for things that may take a long time to complete, enumerating terms is one of them.

          Show
          Steve Davids added a comment - There doesn't appear to be any Lucene code that is specifically honoring a thread interrupt, so if Solr/Lucene is busy enumerating terms in a continual for loop, sending an interrupt won't actually do anything. The Java code needs to check if the thread has been interrupted, if so, then bail on the current process. Blur does this by creating their own "ExitableTerms", "ExitableTermsEnum", etc where every time the enum next method is called, it will check to see if the thread has been interrupted, if it is then an exception is thrown which halts processing of the query. https://git-wip-us.apache.org/repos/asf?p=incubator-blur.git;a=blob;f=blur-store/src/main/java/org/apache/blur/index/ExitableReader.java;h=8321dd27d3537ee239f876448e56e8296407700b;hb=61480125dee51c469a4921004f6daf590410bca6 Performing the thread interrupt check within Lucene seems reasonable for things that may take a long time to complete, enumerating terms is one of them.
          Hide
          Anshum Gupta added a comment -

          You're right Steve. Also, I tried something that I was taking a chance on solving this but it didn't.
          I'm back on the original plan to throw an exception and break during query expansion directly. Most probably, this JIRA would need to be moved to being a LUCENE issue for better tracking (will do that once I put up something).

          Things have changed and the same solution wouldn't really work and I think even BLUR has moved away from that implementation ever since. I'm trying to work out something on similar lines.

          Show
          Anshum Gupta added a comment - You're right Steve. Also, I tried something that I was taking a chance on solving this but it didn't. I'm back on the original plan to throw an exception and break during query expansion directly. Most probably, this JIRA would need to be moved to being a LUCENE issue for better tracking (will do that once I put up something). Things have changed and the same solution wouldn't really work and I think even BLUR has moved away from that implementation ever since. I'm trying to work out something on similar lines.
          Hide
          Anshum Gupta added a comment -

          Working on adding more test and to decide on where the ExitableReader actually belongs.

          Show
          Anshum Gupta added a comment - Working on adding more test and to decide on where the ExitableReader actually belongs.
          Hide
          Steve Davids added a comment -

          We came across the issue again and added a lot more probes to get a grasp on what exactly is happening, I believe further tickets might be necessary to address various pieces.

          #1) We are setting the "timeout" request parameter which tells the TimeLimitingCollector to throw a TimeExceededException, though in our logs we see the error messages thrown after about an hour for one of the queries we tried, even though the timeout is set for a couple of minutes. This is presumably due to the query parsing taking about an hour and once the query is finally parsed and handed to the collector the TimeLimitingCollector immediately throws in exception. We should have something similar throw the same exception while in the query building phase (this way the partial results warnings will continue to just work). It looks like the current work is more in the realm of solving this issue which may fix the problems we saw described in #2.

          #2) We set socket read timeouts on HTTPClient which causes the same query to be sent into the cluster multiple times giving it a slow, painful death. This is even more problematic while using the SolrJ API, what ends up happening from SolrJ's LBHttpSolrServer is that it will loop through every host in the cluster and if a socket read timeout happens it tries the next item in the list. Internally every single request made to the cluster from an outside SolrJ client will try to gather the results for all shards in the cluster, once a socket read timeout happens internal to the cluster the same retry logic will attempt to gather results from the next replica in the list. So, if we hypothetically had 10 shards with 3 replicas, and made a request from an outside client it would make 30 (external SolrJ call to each host to request a distributed search) * 30 (each host will be called at least once for the internal distributed request) = 900 overall requests (each individual search host will handle 30 requests). This should probably become it's own ticket to track, to either a) don't retry on a socket read timeout or b) specify a retry timeout of some sort in the LBHttpSolrServer (this is something we did internally for simplicity sake).

          Show
          Steve Davids added a comment - We came across the issue again and added a lot more probes to get a grasp on what exactly is happening, I believe further tickets might be necessary to address various pieces. #1) We are setting the "timeout" request parameter which tells the TimeLimitingCollector to throw a TimeExceededException, though in our logs we see the error messages thrown after about an hour for one of the queries we tried, even though the timeout is set for a couple of minutes. This is presumably due to the query parsing taking about an hour and once the query is finally parsed and handed to the collector the TimeLimitingCollector immediately throws in exception. We should have something similar throw the same exception while in the query building phase (this way the partial results warnings will continue to just work). It looks like the current work is more in the realm of solving this issue which may fix the problems we saw described in #2. #2) We set socket read timeouts on HTTPClient which causes the same query to be sent into the cluster multiple times giving it a slow, painful death. This is even more problematic while using the SolrJ API, what ends up happening from SolrJ's LBHttpSolrServer is that it will loop through every host in the cluster and if a socket read timeout happens it tries the next item in the list. Internally every single request made to the cluster from an outside SolrJ client will try to gather the results for all shards in the cluster, once a socket read timeout happens internal to the cluster the same retry logic will attempt to gather results from the next replica in the list. So, if we hypothetically had 10 shards with 3 replicas, and made a request from an outside client it would make 30 (external SolrJ call to each host to request a distributed search) * 30 (each host will be called at least once for the internal distributed request) = 900 overall requests (each individual search host will handle 30 requests). This should probably become it's own ticket to track, to either a) don't retry on a socket read timeout or b) specify a retry timeout of some sort in the LBHttpSolrServer (this is something we did internally for simplicity sake).
          Hide
          Jim Walker added a comment -

          Steve, your last post does not incorporate the patch from Anshum does it?

          Show
          Jim Walker added a comment - Steve, your last post does not incorporate the patch from Anshum does it?
          Hide
          Steve Davids added a comment -

          Correct, I was just providing additional insight into the issues we have been seeing.

          Show
          Steve Davids added a comment - Correct, I was just providing additional insight into the issues we have been seeing.
          Hide
          Anshum Gupta added a comment -

          New patch with the same approach but a few things fixed.

          This uses timeAllowed (also used by the TimeLimitingCollector) to timeout queries during rewrite. Here's the fine print though, both the TimeLimitingCollected manage their own counter for the clock i.e. The maximum time that a query could run for is (2 * timeAllowed + Delta), where Delta is the time used for anything other than query expansion and collection.

          I think this should be ok, specially considering the intention is to make sure that the request is killed and doesn't run forever.
          There's always room for improvement though and we might want to share the same counter/offset/timeout so it's more predictable.

          P.S: I'm still running the pre-commit to check if I'm missing something and would then run the solr+lucene tests again.

          Show
          Anshum Gupta added a comment - New patch with the same approach but a few things fixed. This uses timeAllowed (also used by the TimeLimitingCollector) to timeout queries during rewrite. Here's the fine print though, both the TimeLimitingCollected manage their own counter for the clock i.e. The maximum time that a query could run for is (2 * timeAllowed + Delta), where Delta is the time used for anything other than query expansion and collection. I think this should be ok, specially considering the intention is to make sure that the request is killed and doesn't run forever. There's always room for improvement though and we might want to share the same counter/offset/timeout so it's more predictable. P.S: I'm still running the pre-commit to check if I'm missing something and would then run the solr+lucene tests again.
          Hide
          Anshum Gupta added a comment - - edited

          Updated patch with more things fixed and optimized. I still need to add more tests (working on them).

          • Removed unwanted Overrides
          • Changed and fixed class names.
          • Initialization of ThreadLocal variable to default value instead of a null check to make things easier to understand.
          • Setting the log message for the ExitingReaderException().
          • Removed unwanted null check in the ExitObject.reset() method.
          Show
          Anshum Gupta added a comment - - edited Updated patch with more things fixed and optimized. I still need to add more tests (working on them). Removed unwanted Overrides Changed and fixed class names. Initialization of ThreadLocal variable to default value instead of a null check to make things easier to understand. Setting the log message for the ExitingReaderException(). Removed unwanted null check in the ExitObject.reset() method.
          Hide
          Steve Davids added a comment -

          I think this should be ok, specially considering the intention is to make sure that the request is killed and doesn't run forever.

          +1, this is a good starting point and can be further refined in the future if need be.

          I went ahead and opened SOLR-6496 to account for the LBHttpSolrServer's continual retries. Also, I am a little concerned that the cursorMark doesn't honor the timeAllowed request parameter for some strange reason (the cursorMark ticket didn't provide any rational for it), we may want to revisit that decision in yet another ticket so people can be confident their cursor mark queries won't crash their clusters as well.

          Thanks for taking this on Anshum!

          Show
          Steve Davids added a comment - I think this should be ok, specially considering the intention is to make sure that the request is killed and doesn't run forever. +1, this is a good starting point and can be further refined in the future if need be. I went ahead and opened SOLR-6496 to account for the LBHttpSolrServer's continual retries. Also, I am a little concerned that the cursorMark doesn't honor the timeAllowed request parameter for some strange reason (the cursorMark ticket didn't provide any rational for it), we may want to revisit that decision in yet another ticket so people can be confident their cursor mark queries won't crash their clusters as well. Thanks for taking this on Anshum!
          Hide
          Anshum Gupta added a comment -

          I haven't looked at all of the implementation for the cursor mark but if it's query rewriting/expansion that takes time, this patch should fix the issue. I'll open another issue after I commit this one to use a single timeoutAt value. Ideally, it should be a single exitObject for a request that gets used by everything that needs to limit the processing time.

          Show
          Anshum Gupta added a comment - I haven't looked at all of the implementation for the cursor mark but if it's query rewriting/expansion that takes time, this patch should fix the issue. I'll open another issue after I commit this one to use a single timeoutAt value. Ideally, it should be a single exitObject for a request that gets used by everything that needs to limit the processing time.
          Hide
          Anshum Gupta added a comment -

          New patch with a little different approach.

          • Renamed the ExitObject to QueryTimeout as it happens to actually be just a QueryTimeout.
          • Instead of setting/resetting it inside the SolrIndexSearcher methods, I'm instead setting and resetting the QueryTimeout at the Handler level. I've made MLTHandler and the SearchHandler to work with this as I can't think of any other handler that would be affected by this.
          • As we'd now be setting the timeOut at a more global level, we can work towards (in another JIRA) using this value for different components and different stages e.g. TimeLimitingCollector etc.
          Show
          Anshum Gupta added a comment - New patch with a little different approach. Renamed the ExitObject to QueryTimeout as it happens to actually be just a QueryTimeout. Instead of setting/resetting it inside the SolrIndexSearcher methods, I'm instead setting and resetting the QueryTimeout at the Handler level. I've made MLTHandler and the SearchHandler to work with this as I can't think of any other handler that would be affected by this. As we'd now be setting the timeOut at a more global level, we can work towards (in another JIRA) using this value for different components and different stages e.g. TimeLimitingCollector etc.
          Hide
          Rich Cariens added a comment - - edited

          How does the group feel about adding interruption hooks into the ExitableTermsEnum.checkAndThrow() method? Something like:

              private void checkAndThrow() {
                if (QueryTimeout.shouldExit()) {
                  throw new ExitingReaderException("The request took too long to iterate over terms.");
                } else if (Thread.interrupted()) {
                  throw new ExitingReaderException("Interrupted while iterating over terms.");
                }
              }
          

          Seems like this would expose another handy hook into the query life-cycle.

          Show
          Rich Cariens added a comment - - edited How does the group feel about adding interruption hooks into the ExitableTermsEnum.checkAndThrow() method? Something like: private void checkAndThrow() { if (QueryTimeout.shouldExit()) { throw new ExitingReaderException("The request took too long to iterate over terms."); } else if (Thread.interrupted()) { throw new ExitingReaderException("Interrupted while iterating over terms."); } } Seems like this would expose another handy hook into the query life-cycle.
          Hide
          Anshum Gupta added a comment -

          Added a Lucene test, fixed comments and renamed a few things.
          Looks fine to me other than the decision of what package/module does this belong to.

          I'm running the entire test suite now. It passes for the new tests but posting this before running the entire suite as that would be another 40 min.

          Show
          Anshum Gupta added a comment - Added a Lucene test, fixed comments and renamed a few things. Looks fine to me other than the decision of what package/module does this belong to. I'm running the entire test suite now. It passes for the new tests but posting this before running the entire suite as that would be another 40 min.
          Hide
          Anshum Gupta added a comment -

          Added the check for Thread.interrupted(). I wasn't sure how it'd work but I tested some random code (non-lucene/solr) and seems like it makes sense to add that there.
          Thanks Rich Cariens.

          Show
          Anshum Gupta added a comment - Added the check for Thread.interrupted(). I wasn't sure how it'd work but I tested some random code (non-lucene/solr) and seems like it makes sense to add that there. Thanks Rich Cariens .
          Hide
          Anshum Gupta added a comment -

          Changed some java doc to make it easier for people to understand things.

          Perhaps a better way to review? https://reviews.apache.org/r/25658/

          P.S: Not sure how it should be done here but I uploaded the patch @ review board and pasted the link for the same here. Anyone here knows of a way to integrate the 2 so that I wouldn't have to upload at both places and paste the link?

          Show
          Anshum Gupta added a comment - Changed some java doc to make it easier for people to understand things. Perhaps a better way to review? https://reviews.apache.org/r/25658/ P.S: Not sure how it should be done here but I uploaded the patch @ review board and pasted the link for the same here. Anyone here knows of a way to integrate the 2 so that I wouldn't have to upload at both places and paste the link?
          Hide
          Anshum Gupta added a comment - - edited

          Thanks Robert for the review. Updated the patch with most of your feedback + recommendations.

          I didn't change the design from using a ThreadLocal to passing the timeout in the constructor as the user might pre-construct and reuse the reader/searcher. (Solr specifically wouldn't play well with such a change).

          I did consider doing it that way but wasn't able figure out a way to do that. I'll just spend some more time to see if there's a clean way to do it (have the timeout be a passed parameter to the constructor instead of being ThreadLocal).

          About the test relying on system clock, not really that much as the sleep in wrapped up .next() and the timeout values are not even close.

          Also, I can't figure out how to view the review on the review board (doesn't show up for me and the mail went to spam until Steve mentioned about the email). I've posted the updated patch there to make it easier for everyone to look at it.

          Show
          Anshum Gupta added a comment - - edited Thanks Robert for the review. Updated the patch with most of your feedback + recommendations. I didn't change the design from using a ThreadLocal to passing the timeout in the constructor as the user might pre-construct and reuse the reader/searcher. (Solr specifically wouldn't play well with such a change). I did consider doing it that way but wasn't able figure out a way to do that. I'll just spend some more time to see if there's a clean way to do it (have the timeout be a passed parameter to the constructor instead of being ThreadLocal). About the test relying on system clock, not really that much as the sleep in wrapped up .next() and the timeout values are not even close. Also, I can't figure out how to view the review on the review board (doesn't show up for me and the mail went to spam until Steve mentioned about the email). I've posted the updated patch there to make it easier for everyone to look at it.
          Hide
          Steve Rowe added a comment -

          Anshum Gupta, you apparently created two review requests, and Robert Muir reviewed the first one: https://reviews.apache.org/r/25656/ - you can see his review there.

          Show
          Steve Rowe added a comment - Anshum Gupta , you apparently created two review requests, and Robert Muir reviewed the first one: https://reviews.apache.org/r/25656/ - you can see his review there.
          Hide
          Anshum Gupta added a comment -

          Ah ok, that makes sense. Just that I discarded the first request as soon as I created it.
          The newer patches etc are on the newer request.

          https://reviews.apache.org/r/25658/

          Show
          Anshum Gupta added a comment - Ah ok, that makes sense. Just that I discarded the first request as soon as I created it. The newer patches etc are on the newer request. https://reviews.apache.org/r/25658/
          Hide
          Anshum Gupta added a comment - - edited

          https://reviews.apache.org/r/25658/

          Changed the design to have an abstract/base QueryTimeout class and implementations that use local var and ThreadLocal for Lucene/Solr.

          Changed the Lucene test to use the non-ThreadLocal implementation and SolrIndexSearcher uses the ThreadLocal implementation of QueryTimeout.

          Also, had to change a few tests to increase the timeAllowed on existing grouping related tests.
          Until now, the timeAllowed was only used while collecting the docs and so a q=: grouping query passed fine. With terms enumeration using the same timeAllowed param, requests time out more often than before this change (as expected).

          Show
          Anshum Gupta added a comment - - edited https://reviews.apache.org/r/25658/ Changed the design to have an abstract/base QueryTimeout class and implementations that use local var and ThreadLocal for Lucene/Solr. Changed the Lucene test to use the non-ThreadLocal implementation and SolrIndexSearcher uses the ThreadLocal implementation of QueryTimeout. Also, had to change a few tests to increase the timeAllowed on existing grouping related tests. Until now, the timeAllowed was only used while collecting the docs and so a q= : grouping query passed fine. With terms enumeration using the same timeAllowed param, requests time out more often than before this change (as expected).
          Hide
          Anshum Gupta added a comment -

          Updated patch.

          Here are the changes:

          • Override the two cache methods getCoreCacheKey()/getCombinedCoreAndDeletesKey() to explicitly call super to support NRT.
          • The ExitingReaderException message now contains more information
          • QueryTimeout() and SolrQueryTimeout.set() accept timeAllowed (in ms) instead of timeoutAt value. This makes things easier.
          • QueryTimeoutBase is now an interface instead of being an abstract class. QueryTimeout and SolrQueryTimeout implement it.
          Show
          Anshum Gupta added a comment - Updated patch. Here are the changes: Override the two cache methods getCoreCacheKey()/getCombinedCoreAndDeletesKey() to explicitly call super to support NRT. The ExitingReaderException message now contains more information QueryTimeout() and SolrQueryTimeout.set() accept timeAllowed (in ms) instead of timeoutAt value. This makes things easier. QueryTimeoutBase is now an interface instead of being an abstract class. QueryTimeout and SolrQueryTimeout implement it.
          Hide
          Steve Davids added a comment -

          Looks good to me, the only nit-picky thing I would say is the QueryTimeoutBase name for an interface is strange, you may consider renaming it to "QueryTimeout" and rename the current QueryTimeout class to something along the lines of LuceneQueryTimeout / DefaultQueryTimeout / SimpleQueryTimeout?

          Show
          Steve Davids added a comment - Looks good to me, the only nit-picky thing I would say is the QueryTimeoutBase name for an interface is strange, you may consider renaming it to "QueryTimeout" and rename the current QueryTimeout class to something along the lines of LuceneQueryTimeout / DefaultQueryTimeout / SimpleQueryTimeout?
          Hide
          Anshum Gupta added a comment -

          Thanks for that feedback Steve. I think I overlooked it over the iterations and changes. I spoke to Steve Rowe and he's about to post an updated patch that would also include that change.

          Show
          Anshum Gupta added a comment - Thanks for that feedback Steve. I think I overlooked it over the iterations and changes. I spoke to Steve Rowe and he's about to post an updated patch that would also include that change.
          Hide
          Steve Rowe added a comment - - edited

          Patch (see detailed info about my changes at a new review request I created: https://reviews.apache.org/r/25882/):

          • Renamed QueryTimeoutBase to QueryTimeout, QueryTimeout to QueryTimeoutImpl, and SolrQueryTimeout to SolrQueryTimeoutImpl
          • Changed timeout checks to use to use subtraction rather than direct comparison, to handle overflow (see nanoTime() javadocs)
          • Beefed up javadocs
          • Now parsing timeAllowed request param as a long (added this capability to SolrParams)
          • Added a cloud test
          • Added testing of very large timeAllowed values, and negative values too

          I beasted all 4 tests 20 times each, all are passed.

          Show
          Steve Rowe added a comment - - edited Patch (see detailed info about my changes at a new review request I created: https://reviews.apache.org/r/25882/): Renamed QueryTimeoutBase to QueryTimeout , QueryTimeout to QueryTimeoutImpl , and SolrQueryTimeout to SolrQueryTimeoutImpl Changed timeout checks to use to use subtraction rather than direct comparison, to handle overflow (see nanoTime() javadocs) Beefed up javadocs Now parsing timeAllowed request param as a long (added this capability to SolrParams ) Added a cloud test Added testing of very large timeAllowed values, and negative values too I beasted all 4 tests 20 times each, all are passed.
          Hide
          Anshum Gupta added a comment -

          Thanks Steve. All these changes look great to me.
          I'd like to commit this sometime tomorrow if no one objects to this.

          Show
          Anshum Gupta added a comment - Thanks Steve. All these changes look great to me. I'd like to commit this sometime tomorrow if no one objects to this.
          Hide
          Anshum Gupta added a comment -

          https://reviews.apache.org/r/25658/

          The changes integrate Steve's changes and AtomicReader -> LeafReader refactor.

          I've run precommit and test on trunk so just about to commit this now.

          Show
          Anshum Gupta added a comment - https://reviews.apache.org/r/25658/ The changes integrate Steve's changes and AtomicReader -> LeafReader refactor. I've run precommit and test on trunk so just about to commit this now.
          Hide
          ASF subversion and git services added a comment -

          Commit 1627622 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1627622 ]

          SOLR-5986: Add an ExitableDirectoryReader in Lucene and use that to support exiting of long running queries in Solr.

          Show
          ASF subversion and git services added a comment - Commit 1627622 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1627622 ] SOLR-5986 : Add an ExitableDirectoryReader in Lucene and use that to support exiting of long running queries in Solr.
          Hide
          ASF subversion and git services added a comment -

          Commit 1627635 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1627635 ]

          SOLR-5986: Add an ExitableDirectoryReader in Lucene and use that to support exiting of long running queries in Solr. (Merge from trunk r1627622)

          Show
          ASF subversion and git services added a comment - Commit 1627635 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1627635 ] SOLR-5986 : Add an ExitableDirectoryReader in Lucene and use that to support exiting of long running queries in Solr. (Merge from trunk r1627622)
          Hide
          Shalin Shekhar Mangar added a comment -

          Hi Anshum Gupta, the CloudExitableDirectoryReaderTest has been failing very frequently on jenkins, can you please take a look?

          Show
          Shalin Shekhar Mangar added a comment - Hi Anshum Gupta , the CloudExitableDirectoryReaderTest has been failing very frequently on jenkins, can you please take a look?
          Hide
          Anshum Gupta added a comment -

          Shalin Shekhar Mangar, had planned to wrap this today.
          Just trying to minimize the variables involved in the test while I change it.

          Show
          Anshum Gupta added a comment - Shalin Shekhar Mangar , had planned to wrap this today. Just trying to minimize the variables involved in the test while I change it.
          Hide
          ASF subversion and git services added a comment -

          Commit 1629329 from hossman@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1629329 ]

          SOLR-5986: comment out failing assertion in TestDistributedSearch until anshum can review/fix

          Show
          ASF subversion and git services added a comment - Commit 1629329 from hossman@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1629329 ] SOLR-5986 : comment out failing assertion in TestDistributedSearch until anshum can review/fix
          Hide
          ASF subversion and git services added a comment -

          Commit 1629330 from hossman@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1629330 ]

          SOLR-5986: comment out failing assertion in TestDistributedSearch until anshum can review/fix (merge r1629329)

          Show
          ASF subversion and git services added a comment - Commit 1629330 from hossman@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1629330 ] SOLR-5986 : comment out failing assertion in TestDistributedSearch until anshum can review/fix (merge r1629329)
          Hide
          Hoss Man added a comment -

          FYI: this assertion (modified by r1627622/r1627635) has been failing in jenkins several times since committed...

          1498993     shalin       // test group query
          1627635     anshum       // TODO: Remove this? This doesn't make any real sense now that timeAllowed might trigger early
          1627635     anshum       //       termination of the request during Terms enumeration/Query expansion.
          1627635     anshum       //       During such an exit, partial results isn't supported as it wouldn't make any sense.
          1627635     anshum       // Increasing the timeAllowed from 1 to 100 for now.
          1498993     shalin       queryPartialResults(upShards, upClients,
          1498993     shalin           "q", "*:*",
          1498993     shalin           "rows", 100,
          1498993     shalin           "fl", "id," + i1,
          1498993     shalin           "group", "true",
          1498993     shalin           "group.query", t1 + ":kings OR " + t1 + ":eggs",
          1498993     shalin           "group.limit", 10,
          1498993     shalin           "sort", i1 + " asc, id asc",
          1627635     anshum           CommonParams.TIME_ALLOWED, 100,
          1498993     shalin           ShardParams.SHARDS_INFO, "true",
          1498993     shalin           ShardParams.SHARDS_TOLERANT, "true");
          

          example: http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/11221/

          Error Message:
          Request took too long during query expansion. Terminating request.
          
          Stack Trace:
          org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Request took too long during query expansion. Terminating request.
                  at __randomizedtesting.SeedInfo.seed([377AFD4F005F159A:B69C7357770075A6]:0)
                  at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:570)
                  at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215)
                  at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
                  at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
                  at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
                  at org.apache.solr.TestDistributedSearch.queryPartialResults(TestDistributedSearch.java:596)
                  at org.apache.solr.TestDistributedSearch.doTest(TestDistributedSearch.java:499)
                  at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:875)
          

          I'm not fully understanding what anshum ment by this TODO, and I think he's offline for the next few days, so i went ahead and comment this out with a link back to this jira for him to look at before resolving this jira.

          Show
          Hoss Man added a comment - FYI: this assertion (modified by r1627622/r1627635) has been failing in jenkins several times since committed... 1498993 shalin // test group query 1627635 anshum // TODO: Remove this? This doesn't make any real sense now that timeAllowed might trigger early 1627635 anshum // termination of the request during Terms enumeration/Query expansion. 1627635 anshum // During such an exit, partial results isn't supported as it wouldn't make any sense. 1627635 anshum // Increasing the timeAllowed from 1 to 100 for now. 1498993 shalin queryPartialResults(upShards, upClients, 1498993 shalin "q", "*:*", 1498993 shalin "rows", 100, 1498993 shalin "fl", "id," + i1, 1498993 shalin "group", "true", 1498993 shalin "group.query", t1 + ":kings OR " + t1 + ":eggs", 1498993 shalin "group.limit", 10, 1498993 shalin "sort", i1 + " asc, id asc", 1627635 anshum CommonParams.TIME_ALLOWED, 100, 1498993 shalin ShardParams.SHARDS_INFO, "true", 1498993 shalin ShardParams.SHARDS_TOLERANT, "true"); example: http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/11221/ Error Message: Request took too long during query expansion. Terminating request. Stack Trace: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Request took too long during query expansion. Terminating request. at __randomizedtesting.SeedInfo.seed([377AFD4F005F159A:B69C7357770075A6]:0) at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:570) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) at org.apache.solr.TestDistributedSearch.queryPartialResults(TestDistributedSearch.java:596) at org.apache.solr.TestDistributedSearch.doTest(TestDistributedSearch.java:499) at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:875) I'm not fully understanding what anshum ment by this TODO, and I think he's offline for the next few days, so i went ahead and comment this out with a link back to this jira for him to look at before resolving this jira.
          Hide
          Anshum Gupta added a comment -

          Here's what I meant with that statement: This test should either be removed or modified to ensure that the timeAllowed is never hit during query expansion.

          Here's more of the context:
          Until 5986 was committed, the timeAllowed parameter was only used during the collection stage. That stage also supported returning of partial matches if some shards returned responses and didn't time out.

          After this commit, the timeAllowed parameter could lead to early termination of a request way before the search actually happens i.e. during query expansion. At this stage, partial results aren't returned.

          The current test tries to send a request assuming that the timeOut would happen only during the collection stage, leading to partial results being returned. BUT if, the request times out during query expansion, no partial results are returned, leading to a test failure.

          I'll remove the partial results test. I'll also think about adding something to replace this (I certainly don't want coverage to go down but this test isn't really a valid case anymore). May be add something that uses caching to avoid query expansion but times out during doc collection.

          Show
          Anshum Gupta added a comment - Here's what I meant with that statement: This test should either be removed or modified to ensure that the timeAllowed is never hit during query expansion. Here's more of the context: Until 5986 was committed, the timeAllowed parameter was only used during the collection stage. That stage also supported returning of partial matches if some shards returned responses and didn't time out. After this commit, the timeAllowed parameter could lead to early termination of a request way before the search actually happens i.e. during query expansion. At this stage, partial results aren't returned. The current test tries to send a request assuming that the timeOut would happen only during the collection stage, leading to partial results being returned. BUT if, the request times out during query expansion, no partial results are returned, leading to a test failure. I'll remove the partial results test. I'll also think about adding something to replace this (I certainly don't want coverage to go down but this test isn't really a valid case anymore). May be add something that uses caching to avoid query expansion but times out during doc collection.
          Hide
          Steve Davids added a comment -

          Why wouldn't it return partial results? When sending a distributed request if all but one return results but one shard lags behind at query expansion one would think that you would get the appropriate partial results message. Unless this is partially related to SOLR-6496 which would retry a different replica in the shard group and thus could cause a timeout at the Solr distributed aggregation layer.

          Show
          Steve Davids added a comment - Why wouldn't it return partial results? When sending a distributed request if all but one return results but one shard lags behind at query expansion one would think that you would get the appropriate partial results message. Unless this is partially related to SOLR-6496 which would retry a different replica in the shard group and thus could cause a timeout at the Solr distributed aggregation layer.
          Hide
          Anshum Gupta added a comment -

          The partial results option (from what I remember) is only set when a TimeExceededException is caught. For now, the intention behind this was to only keep the cluster healthy. I do have a few things on my mind as far as improving/iterating on this is concerned e.g. respect a global timer (per request), standardize the behavior, etc.

          Show
          Anshum Gupta added a comment - The partial results option (from what I remember) is only set when a TimeExceededException is caught. For now, the intention behind this was to only keep the cluster healthy. I do have a few things on my mind as far as improving/iterating on this is concerned e.g. respect a global timer (per request), standardize the behavior, etc.
          Hide
          Shalin Shekhar Mangar added a comment -

          The partial results option (from what I remember) is only set when a TimeExceededException is caught. For now, the intention behind this was to only keep the cluster healthy

          Does that mean that if I specify shards.tolerant=true and timeAllowed parameter and if the query on a shard is terminated because of this feature, then partialResults flag is not set on responses?

          Show
          Shalin Shekhar Mangar added a comment - The partial results option (from what I remember) is only set when a TimeExceededException is caught. For now, the intention behind this was to only keep the cluster healthy Does that mean that if I specify shards.tolerant=true and timeAllowed parameter and if the query on a shard is terminated because of this feature, then partialResults flag is not set on responses?
          Hide
          Anshum Gupta added a comment -

          Shalin Shekhar Mangar Yes, as of now, an exception is expected in case a query fails due to timingOut while rewriting the query.

          Show
          Anshum Gupta added a comment - Shalin Shekhar Mangar Yes, as of now, an exception is expected in case a query fails due to timingOut while rewriting the query.
          Hide
          Anshum Gupta added a comment -

          I'm considering returning partial results to make sure that we don't have to modify a lot of our tests (specially for grouping + timeAllowed). I'll put up a patch for that.

          Show
          Anshum Gupta added a comment - I'm considering returning partial results to make sure that we don't have to modify a lot of our tests (specially for grouping + timeAllowed). I'll put up a patch for that.
          Hide
          Tomás Fernández Löbbe added a comment -

          My understanding is that there are two cases where "partialResults=true" is returned. If timeAllowed is specified and a TimeExceededException is caught as you said, but also when a shard request fails (for any exception) and "shards.tolerant=true".
          I guess in distributed request right now the second case will occur?

          Show
          Tomás Fernández Löbbe added a comment - My understanding is that there are two cases where "partialResults=true" is returned. If timeAllowed is specified and a TimeExceededException is caught as you said, but also when a shard request fails (for any exception) and "shards.tolerant=true". I guess in distributed request right now the second case will occur?
          Hide
          Anshum Gupta added a comment -

          Correct, Tomás Fernández Löbbe.

          Right now, when a TimeExceededException is caught in SolrIndexSearcher.buildAndRunCollectorChain, it explicitly sets qr.setPartialResults(true). Nothing of this sort happens in case of an ExitingReaderException leading to different behavior between when a query takes too long to collect vs rewriting (even though it's the same parameter that gets used). I'll just change this so that ExitingReaderException is also caught and handled.

          Show
          Anshum Gupta added a comment - Correct, Tomás Fernández Löbbe . Right now, when a TimeExceededException is caught in SolrIndexSearcher.buildAndRunCollectorChain, it explicitly sets qr.setPartialResults(true). Nothing of this sort happens in case of an ExitingReaderException leading to different behavior between when a query takes too long to collect vs rewriting (even though it's the same parameter that gets used). I'll just change this so that ExitingReaderException is also caught and handled.
          Hide
          Anshum Gupta added a comment -

          An attempt at fixing the tests. With this patch, the ExitingReaderException also sets partialResults as true in the response like the TimeExceededException.

          This should get the behavior to be more similar than how it is now but the Exception in case of ExitingReader is re-thrown.

          Show
          Anshum Gupta added a comment - An attempt at fixing the tests. With this patch, the ExitingReaderException also sets partialResults as true in the response like the TimeExceededException. This should get the behavior to be more similar than how it is now but the Exception in case of ExitingReader is re-thrown.
          Hide
          Anshum Gupta added a comment -

          Patch without the debug prints. Also, I'm trying to run the tests locally with these changes for a few hours. If I don't see any issues, I'll commit this patch.

          Show
          Anshum Gupta added a comment - Patch without the debug prints. Also, I'm trying to run the tests locally with these changes for a few hours. If I don't see any issues, I'll commit this patch.
          Hide
          Anshum Gupta added a comment -

          A better patch for fixing the failures (though it's more invasive).
          This changes the timeout response as now we don't return an exception but move on the lines of TimeLimitingCollector.

          The presence of 'partialResults' in the 'responseHeader' would imply that timeAllowed kicked in and aborted processing of one or more shards (the way it happens right now). We log the exception in the logs though. Also changed things in the Grouping flow though I'll perhaps spend a couple of more hours on that.

          Show
          Anshum Gupta added a comment - A better patch for fixing the failures (though it's more invasive). This changes the timeout response as now we don't return an exception but move on the lines of TimeLimitingCollector. The presence of 'partialResults' in the 'responseHeader' would imply that timeAllowed kicked in and aborted processing of one or more shards (the way it happens right now). We log the exception in the logs though. Also changed things in the Grouping flow though I'll perhaps spend a couple of more hours on that.
          Hide
          Anshum Gupta added a comment -

          Digging deeper into all of this, seems like there's a bigger problem to solve at hand. As of now, when timeAllowed is set, we never get back an exception but just partialResults in the response header is set to true in case of a shard failure. This translates to shards.tolerant being ignored in that case.
          On the code level, the TimeExceededException never reaches ShardHandler and so the Exception is never set (similarly for ExitingReaderException) and/or returned to the client.

          I'll create another issue to handle all of that and take it up once this issue is resolved.

          For this issue, I think we should just target sticking to the current behavior when timeAllowed is passed i.e. set partialResults in the header to true when either of the time limiting exceptions are hit. The last patch uploaded here should fix that.

          Show
          Anshum Gupta added a comment - Digging deeper into all of this, seems like there's a bigger problem to solve at hand. As of now, when timeAllowed is set, we never get back an exception but just partialResults in the response header is set to true in case of a shard failure. This translates to shards.tolerant being ignored in that case. On the code level, the TimeExceededException never reaches ShardHandler and so the Exception is never set (similarly for ExitingReaderException) and/or returned to the client. I'll create another issue to handle all of that and take it up once this issue is resolved. For this issue, I think we should just target sticking to the current behavior when timeAllowed is passed i.e. set partialResults in the header to true when either of the time limiting exceptions are hit. The last patch uploaded here should fix that.
          Hide
          Steve Rowe added a comment -

          For this issue, I think we should just target sticking to the current behavior when timeAllowed is passed i.e. set partialResults in the header to true when either of the time limiting exceptions are hit. The last patch uploaded here should fix that.

          +1 to commit the last patch.

          Show
          Steve Rowe added a comment - For this issue, I think we should just target sticking to the current behavior when timeAllowed is passed i.e. set partialResults in the header to true when either of the time limiting exceptions are hit. The last patch uploaded here should fix that. +1 to commit the last patch.
          Hide
          ASF subversion and git services added a comment -

          Commit 1630583 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1630583 ]

          SOLR-5986: Fix tests and start returning partial results in case of ExitingReaderException

          Show
          ASF subversion and git services added a comment - Commit 1630583 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1630583 ] SOLR-5986 : Fix tests and start returning partial results in case of ExitingReaderException
          Hide
          ASF subversion and git services added a comment -

          Commit 1630611 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1630611 ]

          SOLR-5986: Fix tests and start returning partial results in case of ExitingReaderException (Merge from trunk r1630583)

          Show
          ASF subversion and git services added a comment - Commit 1630611 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1630611 ] SOLR-5986 : Fix tests and start returning partial results in case of ExitingReaderException (Merge from trunk r1630583)
          Hide
          ASF subversion and git services added a comment -

          Commit 1631145 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1631145 ]

          SOLR-5986: Removing a todo that was introduced with a prior commit. It doesn't make any sense now as partialResults are now returned in all cases where timeAllowed kicks in

          Show
          ASF subversion and git services added a comment - Commit 1631145 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1631145 ] SOLR-5986 : Removing a todo that was introduced with a prior commit. It doesn't make any sense now as partialResults are now returned in all cases where timeAllowed kicks in
          Hide
          ASF subversion and git services added a comment -

          Commit 1631146 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1631146 ]

          SOLR-5986: Removing a todo that was introduced with a prior commit. It doesn't make any sense now as partialResults are now returned in all cases where timeAllowed kicks in (merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1631146 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1631146 ] SOLR-5986 : Removing a todo that was introduced with a prior commit. It doesn't make any sense now as partialResults are now returned in all cases where timeAllowed kicks in (merge from trunk)
          Hide
          Anshum Gupta added a comment - - edited

          I'm marking this issues as resolved now as all tests pass, also, no todo(s) is/are left unresolved.
          I've opened another issue (SOLR-6616) that is related but open for discussion.

          Show
          Anshum Gupta added a comment - - edited I'm marking this issues as resolved now as all tests pass, also, no todo(s) is/are left unresolved. I've opened another issue ( SOLR-6616 ) that is related but open for discussion.
          Hide
          Anshum Gupta added a comment -

          Bulk close after 5.0 release.

          Show
          Anshum Gupta added a comment - Bulk close after 5.0 release.

            People

            • Assignee:
              Anshum Gupta
              Reporter:
              Steve Davids
            • Votes:
              3 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development