Solr
  1. Solr
  2. SOLR-3284

StreamingUpdateSolrServer swallows exceptions

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.5, 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: clients - java
    • Labels:
      None

      Description

      StreamingUpdateSolrServer eats exceptions thrown by lower level code, such as HttpClient, when doing adds. It may happen with other methods, though I know that query and deleteByQuery will throw exceptions. I believe that this is a result of the queue/Runner design. That's what makes SUSS perform better, but it means you sacrifice the ability to programmatically determine that there was a problem with your update. All errors are logged via slf4j, but that's not terribly helpful except with determining what went wrong after the fact.

      When using CommonsHttpSolrServer, I've been able to rely on getting an exception thrown by pretty much any error, letting me use try/catch to detect problems.

      There's probably enough dependent code out there that it would not be a good idea to change the design of SUSS, unless there were alternate constructors or additional methods available to configure new/old behavior. Fixing this is probably not trivial, so it's probably a better idea to come up with a new server object based on CHSS. This is outside my current skillset.

      1. SOLR-3284.patch
        3 kB
        Shawn Heisey

        Activity

        Hide
        Sami Siren added a comment -

        What are the operations/error situations where you are not seeing an Exception when you expect one?

        By default the ConcurrentUpdateSolrServer (StreamingUpdateSolrServer) just logs the exceptions from updates but you can override this functionality:

            SolrServer server = new ConcurrentUpdateSolrServer("http://127.0.0.1:8983/solr", 1, 1){
              public void handleError(Throwable ex) {
                //do something with the Throwable here
                System.out.println("Something wrong!" + ex.getMessage());
              }
            };
            
            server.add(new SolrInputDocument());
        
        

        The current exception reporting is pretty limited and it is impossible to see which operation triggered the exception but such improvements should be done in separate issues.

        Show
        Sami Siren added a comment - What are the operations/error situations where you are not seeing an Exception when you expect one? By default the ConcurrentUpdateSolrServer (StreamingUpdateSolrServer) just logs the exceptions from updates but you can override this functionality: SolrServer server = new ConcurrentUpdateSolrServer( "http: //127.0.0.1:8983/solr" , 1, 1){ public void handleError(Throwable ex) { // do something with the Throwable here System .out.println( "Something wrong!" + ex.getMessage()); } }; server.add( new SolrInputDocument()); The current exception reporting is pretty limited and it is impossible to see which operation triggered the exception but such improvements should be done in separate issues.
        Hide
        Shawn Heisey added a comment -

        If the Solr server goes down in between updates done with the concurrent server, doing further updates will fail, but the calling code will not know that. With the Commons or Http server, an exception is thrown that my code catches.

        I don't think that just overriding handleError is enough. If Solr goes down but the machine is still up, you have immediate failure detection because the connection will be refused. If the server goes away entirely, it could take a couple of minutes to fail. You would have to provide methods to check that 1) all background operations are complete and 2) they were error free.

        I can no longer remember whether an exception is thrown when trying a commit against a down machine with the concurrent server. IIRC it does throw one in this instance. I definitely believe that it should. Perhaps the current handleError code could update class-level members (with names like "boolean updateErrored" and "SolrServerException updateException") that could be checked and used by the commit method. If they are set, it would reset them and throw an exception (fast-fail) without actually trying the commit. There should probably be a constructor option and a set method to either activate this new behavior or restore the original behavior.

        When I first designed my code, I was relying on the exceptions thrown by the commons server when doing the actual update, so it's too late by the time it reaches the commit - it has already updated the position values. I now realize that this is incorrect design, though I might never have figured it out without my attempt to use the concurrent server. It's going to be a bit painful to redesign my code to put off updating position values until after a successful commit operation. It's something I do intend to do.

        Show
        Shawn Heisey added a comment - If the Solr server goes down in between updates done with the concurrent server, doing further updates will fail, but the calling code will not know that. With the Commons or Http server, an exception is thrown that my code catches. I don't think that just overriding handleError is enough. If Solr goes down but the machine is still up, you have immediate failure detection because the connection will be refused. If the server goes away entirely, it could take a couple of minutes to fail. You would have to provide methods to check that 1) all background operations are complete and 2) they were error free. I can no longer remember whether an exception is thrown when trying a commit against a down machine with the concurrent server. IIRC it does throw one in this instance. I definitely believe that it should. Perhaps the current handleError code could update class-level members (with names like "boolean updateErrored" and "SolrServerException updateException") that could be checked and used by the commit method. If they are set, it would reset them and throw an exception (fast-fail) without actually trying the commit. There should probably be a constructor option and a set method to either activate this new behavior or restore the original behavior. When I first designed my code, I was relying on the exceptions thrown by the commons server when doing the actual update, so it's too late by the time it reaches the commit - it has already updated the position values. I now realize that this is incorrect design, though I might never have figured it out without my attempt to use the concurrent server. It's going to be a bit painful to redesign my code to put off updating position values until after a successful commit operation. It's something I do intend to do.
        Hide
        Shawn Heisey added a comment - - edited

        First crack at a patch for throwing delayed exceptions. It should do this on any request when a previous request resulted in an error, not just on commits. I did not attempt to make any unit tests. I'm not entirely sure how unit tests work when things are supposed to succeed, how to simulate a failure is even less obvious.

        Show
        Shawn Heisey added a comment - - edited First crack at a patch for throwing delayed exceptions. It should do this on any request when a previous request resulted in an error, not just on commits. I did not attempt to make any unit tests. I'm not entirely sure how unit tests work when things are supposed to succeed, how to simulate a failure is even less obvious.
        Hide
        Shawn Heisey added a comment -

        The patch should also apply successfully to 3.6.

        Show
        Shawn Heisey added a comment - The patch should also apply successfully to 3.6.
        Hide
        Shawn Heisey added a comment -

        After looking at existing tests to see how I might implement tests for this new functionality, I couldn't see how to do it. Also, I noticed that there are tests for SolrCloud and something else called ChaosMonkey. All tests in solr/ pass with this patch, but I don't know how SolrCloud might be affected. I would hope that it already handles exceptions properly and therefore wouldn't have any problems, but I have never looked at the code or used SolrCloud.

        Show
        Shawn Heisey added a comment - After looking at existing tests to see how I might implement tests for this new functionality, I couldn't see how to do it. Also, I noticed that there are tests for SolrCloud and something else called ChaosMonkey. All tests in solr/ pass with this patch, but I don't know how SolrCloud might be affected. I would hope that it already handles exceptions properly and therefore wouldn't have any problems, but I have never looked at the code or used SolrCloud.
        Hide
        Lance Norskog added a comment -

        In the trunk it is now ConcurrentUpdateSolrServer, and it swallows errors all over the place.

        I am a total fascist about fail-fast/fail-loud. A more obnoxious (fail-fast) design is:

        • Any exception in a worker causes the main driver to kill all other workers.
        • The default handler after this is a rollback.
        • Add a flush() method to SolrServer.
          • CUSS uses this to report back to the main thread that everything blew up.
          • Other server classes would do a socket flush and wait and drain out any error text.
        • Use a finalize() method to complain that the user did not call flush().
          • This is in all SolrServer classes- the same application code should do the same thing with every SolrServer implementation.

        We tell our customers to not use SUSS/CUSS because it does not do errors right. Partly this causes problems because people tend to "design to success" (assume everything works) instead of "design to failure" (assume everything breaks). When CUSS breaks, the production scripting and staff do not notice.

        Show
        Lance Norskog added a comment - In the trunk it is now ConcurrentUpdateSolrServer, and it swallows errors all over the place. I am a total fascist about fail-fast/fail-loud. A more obnoxious (fail-fast) design is: Any exception in a worker causes the main driver to kill all other workers. The default handler after this is a rollback. Add a flush() method to SolrServer. CUSS uses this to report back to the main thread that everything blew up. Other server classes would do a socket flush and wait and drain out any error text. Use a finalize() method to complain that the user did not call flush(). This is in all SolrServer classes- the same application code should do the same thing with every SolrServer implementation. We tell our customers to not use SUSS/CUSS because it does not do errors right. Partly this causes problems because people tend to "design to success" (assume everything works) instead of "design to failure" (assume everything breaks). When CUSS breaks, the production scripting and staff do not notice.
        Hide
        Mike Sokolov added a comment - - edited

        Another approach we have been using in a similar parallel writing situation is to have the thread pool maintain a fixed-size list of exceptions. Whenever a worker throws one, it gets put on the list. Then at certain barrier conditions (flush, exception list full), the exceptions are collected together and re-thrown using an umbrella exception that wraps them. At the same time, all worker threads are terminated. You do need flush(), or can you rely on the user calling commit() (and flush() internally then)?

        This enables writing to continue even in the face of errors, but the errors do get reported eventually. This makes the system to be more robust in the face of transient failure conditions.

        Show
        Mike Sokolov added a comment - - edited Another approach we have been using in a similar parallel writing situation is to have the thread pool maintain a fixed-size list of exceptions. Whenever a worker throws one, it gets put on the list. Then at certain barrier conditions (flush, exception list full), the exceptions are collected together and re-thrown using an umbrella exception that wraps them. At the same time, all worker threads are terminated. You do need flush(), or can you rely on the user calling commit() (and flush() internally then)? This enables writing to continue even in the face of errors, but the errors do get reported eventually. This makes the system to be more robust in the face of transient failure conditions.
        Hide
        Shawn Heisey added a comment -

        Lance, if your idea (or your idea in combination with Mike's) can be implemented without a ton of work, it should definitely replace my patch. If it's a royal pain, my patch is there as an interim solution and yours can be handled in another issue.

        Is the following a good summary of your idea? If so, I think your idea is better. I think it should be the default behavior for 4x and trunk.

        A user of ConcurrentUpdateSolrServer would make update requests, then call a publicly exposed flush(), either explicitly or implicitly by calling commit(). If the flush() fails due to a background failure, all requests since the last successful flush() would be rolled back. If the user code is designed with this in mind, error handling is as good as HttpSolrServer. HSS (and probably ESS) would use flush() internally. The user would be free to call flush() themselves, but it would not be required.

        I'm curious about how things would be handled with autoCommit or commitWithin. Is rollback possible when those are used?

        Show
        Shawn Heisey added a comment - Lance, if your idea (or your idea in combination with Mike's) can be implemented without a ton of work, it should definitely replace my patch. If it's a royal pain, my patch is there as an interim solution and yours can be handled in another issue. Is the following a good summary of your idea? If so, I think your idea is better. I think it should be the default behavior for 4x and trunk. A user of ConcurrentUpdateSolrServer would make update requests, then call a publicly exposed flush(), either explicitly or implicitly by calling commit(). If the flush() fails due to a background failure, all requests since the last successful flush() would be rolled back. If the user code is designed with this in mind, error handling is as good as HttpSolrServer. HSS (and probably ESS) would use flush() internally. The user would be free to call flush() themselves, but it would not be required. I'm curious about how things would be handled with autoCommit or commitWithin. Is rollback possible when those are used?
        Hide
        Alan Woodward added a comment -

        This interacts really badly with bad documents (SOLR-445) - once one of the Runners passes a bad document to SOLR, all documents subsequently processed by that Runner will be silently discarded, and there doesn't seem to be any way to detect this in client code.

        Show
        Alan Woodward added a comment - This interacts really badly with bad documents ( SOLR-445 ) - once one of the Runners passes a bad document to SOLR, all documents subsequently processed by that Runner will be silently discarded, and there doesn't seem to be any way to detect this in client code.
        Hide
        Mikhail Khludnev added a comment - - edited

        Hi,

        Excuse me for hijacking, but server-side analog CUSS SOLR-3585 is really strict in case of any errors, it gives up and throws them back to clients.

        Show
        Mikhail Khludnev added a comment - - edited Hi, Excuse me for hijacking, but server-side analog CUSS SOLR-3585 is really strict in case of any errors, it gives up and throws them back to clients.
        Hide
        Mark Miller added a comment -

        I'd love to see a way to get back specific error information about fails. That seems preferable to just stopping or throwing an exception in a lot of cases. It would be nice to treat different cases separatley though - eg bad doc vs down server.

        Show
        Mark Miller added a comment - I'd love to see a way to get back specific error information about fails. That seems preferable to just stopping or throwing an exception in a lot of cases. It would be nice to treat different cases separatley though - eg bad doc vs down server.
        Hide
        Shawn Heisey added a comment -

        I had an idea for tracking the first update that caused an error. This is a little convoluted, so bear with me a little, and let me know what you think.

        This first part is something that I think would be a good idea to add even if the rest is useless: The server object will have a field called requestNumber, an AtomicInteger, that will be incremented every time an UpdateRequest is used. When an UpdateRequest is made and it does not already contain a parameter called concurrentUpdateId, the string representation of the current requestNumber will be inserted into the request as the value of that parameter. Originally I had planned to have it always use the incrementing integer values, but there will be a lot of people may already have an existing ID value, and using that value would be MUCH easier. The concurrentUpdateId parameter that gets used will be included in the dummy response, and it can also be used when Runner logs status messages from each update.

        Now for the rest: One of the methods on the server object will set the error handling method that will get used. One choice will be exactly the way it works now, including the ability to override the handleError method. Another choice will be to raise a flag that can be checked manually using various get methods. The remaining choice would be very similar to the patch already attached to this issue, except that the update ID of the update that caused the first internal exception will be in the message on the encapsulating exception.

        I've got some of this written already. It is pushing my java skill boundaries a little.

        Even if the latter part is not implemented, if the first part goes in, the user could override a new HandleError(Exception,String) method to track the ID of every update that fails, not just the first.

        Show
        Shawn Heisey added a comment - I had an idea for tracking the first update that caused an error. This is a little convoluted, so bear with me a little, and let me know what you think. This first part is something that I think would be a good idea to add even if the rest is useless: The server object will have a field called requestNumber, an AtomicInteger, that will be incremented every time an UpdateRequest is used. When an UpdateRequest is made and it does not already contain a parameter called concurrentUpdateId, the string representation of the current requestNumber will be inserted into the request as the value of that parameter. Originally I had planned to have it always use the incrementing integer values, but there will be a lot of people may already have an existing ID value, and using that value would be MUCH easier. The concurrentUpdateId parameter that gets used will be included in the dummy response, and it can also be used when Runner logs status messages from each update. Now for the rest: One of the methods on the server object will set the error handling method that will get used. One choice will be exactly the way it works now, including the ability to override the handleError method. Another choice will be to raise a flag that can be checked manually using various get methods. The remaining choice would be very similar to the patch already attached to this issue, except that the update ID of the update that caused the first internal exception will be in the message on the encapsulating exception. I've got some of this written already. It is pushing my java skill boundaries a little. Even if the latter part is not implemented, if the first part goes in, the user could override a new HandleError(Exception,String) method to track the ID of every update that fails, not just the first.
        Hide
        Shawn Heisey added a comment -

        I'll try to separate the patches for each half of this idea.

        Show
        Shawn Heisey added a comment - I'll try to separate the patches for each half of this idea.
        Hide
        Shawn Heisey added a comment -

        I think that having optional error reporting built into in the concurrent class is important, even if it cannot tell you which of your recent updates failed. Does anyone think my patch is a good first step, should I try another approach, or is this whole idea doomed?

        I am not planning to close this issue at this time, but this comment is part of an effort to close old issues that I have reported. Search tag: elyograg2013springclean

        Show
        Shawn Heisey added a comment - I think that having optional error reporting built into in the concurrent class is important, even if it cannot tell you which of your recent updates failed. Does anyone think my patch is a good first step, should I try another approach, or is this whole idea doomed? I am not planning to close this issue at this time, but this comment is part of an effort to close old issues that I have reported. Search tag: elyograg2013springclean
        Hide
        Shawn Heisey added a comment -

        I have a proposed patch that is very likely to need updating because it is so old.

        There is an issue for CloudSolrServer, the one to route documents to the correct shard, that has a concurrent mode that apparently still will throw exceptions. Can that be adapted for use here?

        Show
        Shawn Heisey added a comment - I have a proposed patch that is very likely to need updating because it is so old. There is an issue for CloudSolrServer, the one to route documents to the correct shard, that has a concurrent mode that apparently still will throw exceptions. Can that be adapted for use here?
        Hide
        Stewart Sims added a comment -

        This has been a source of errors that were difficult for us to track down, when using the ConcurrentUpdateSolrServer to index a large volume of data (including some very large individual documents). We were confused for a long time as to why some documents were not being indexed, which turned out to be a combination of data errors and bad request errors due to too many concurrent requests. Once we switched to HttpSolrServer the problems with bad requests went away and we were able to get more informative exceptions which helped to find the data problems.

        Had my colleague not found the thread below, we would have struggled to figure out the exact causes of our problems:
        http://lucene.472066.n3.nabble.com/Missing-documents-with-ConcurrentUpdateSolrServer-vs-HttpSolrServer-td4033637.html

        One alternative to a code change might be to advise caution with using the ConcurrentUpdateSolrServer with large volumes of data. HttpSolrServer seems for us (using Solr 4.2.1) to be more stable and only takes about 30% more time to index.

        Show
        Stewart Sims added a comment - This has been a source of errors that were difficult for us to track down, when using the ConcurrentUpdateSolrServer to index a large volume of data (including some very large individual documents). We were confused for a long time as to why some documents were not being indexed, which turned out to be a combination of data errors and bad request errors due to too many concurrent requests. Once we switched to HttpSolrServer the problems with bad requests went away and we were able to get more informative exceptions which helped to find the data problems. Had my colleague not found the thread below, we would have struggled to figure out the exact causes of our problems: http://lucene.472066.n3.nabble.com/Missing-documents-with-ConcurrentUpdateSolrServer-vs-HttpSolrServer-td4033637.html One alternative to a code change might be to advise caution with using the ConcurrentUpdateSolrServer with large volumes of data. HttpSolrServer seems for us (using Solr 4.2.1) to be more stable and only takes about 30% more time to index.

          People

          • Assignee:
            Shawn Heisey
            Reporter:
            Shawn Heisey
          • Votes:
            4 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

            • Created:
              Updated:

              Development