Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7441

Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 5.1
    • Fix Version/s: 6.0
    • Component/s: None
    • Labels:
      None

      Description

      It's harder than it could be to figure out what the error is when using Streaming Aggregation. For instance if you specify an fl parameter for a field that doesn't exist it's hard to figure out that's the cause. This is true even if you look in the Solr logs.

      I'm not quite sure whether it'd be possible to report this at the client level or not, but it seems at least we could repor something more helpful in the Solr logs.

      1. SOLR-7441.patch
        37 kB
        Joel Bernstein
      2. SOLR-7441.patch
        16 kB
        Joel Bernstein
      3. SOLR-7441.patch
        13 kB
        Gus Heck
      4. SOLR-7441.patch
        13 kB
        Gus Heck

        Issue Links

          Activity

          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Let's get this fixed for 5.2. I remember running across an error that didn't make it to the logs. Later when I went back to see if I could find an error that was not appearing in the logs I wasn't able to find one.

          Is the fl error not ending up in the logs, or just being logged in a complicated way?

          In parallel mode you also end up with a json error on the worker node, as well as the original error on the search node.Would be good propagate the original error from the search node to the worker node to the client so you don't have dig around in the logs.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Let's get this fixed for 5.2. I remember running across an error that didn't make it to the logs. Later when I went back to see if I could find an error that was not appearing in the logs I wasn't able to find one. Is the fl error not ending up in the logs, or just being logged in a complicated way? In parallel mode you also end up with a json error on the worker node, as well as the original error on the search node.Would be good propagate the original error from the search node to the worker node to the client so you don't have dig around in the logs.
          Hide
          erickerickson Erick Erickson added a comment -

          Nothing came out in the server log, and the client log was unhelpful (not quite sure what it was now...). This was just using a CloudSolrStream....

          I'm not at all sure when opening a stream and expecting back a TupleStream but get an error message how it would be decoded into a log message though, so just dumping it into the server log would certainly be a step forward.

          Show
          erickerickson Erick Erickson added a comment - Nothing came out in the server log, and the client log was unhelpful (not quite sure what it was now...). This was just using a CloudSolrStream.... I'm not at all sure when opening a stream and expecting back a TupleStream but get an error message how it would be decoded into a log message though, so just dumping it into the server log would certainly be a step forward.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Ok, I'll take a look and figure out why this error is not making into the logs.

          Show
          joel.bernstein Joel Bernstein added a comment - Ok, I'll take a look and figure out why this error is not making into the logs.
          Hide
          gus_heck Gus Heck added a comment -

          Exceptions on open/start/finish/close are hard, but I think I have a solution for passing back exceptions encountered while reading (with test) if you want it?

          Show
          gus_heck Gus Heck added a comment - Exceptions on open/start/finish/close are hard, but I think I have a solution for passing back exceptions encountered while reading (with test) if you want it?
          Hide
          erickerickson Erick Erickson added a comment -

          Gus:

          It's always appropriate attach a patch!

          We usually label them SOLR-7441.patch in this case, don't worry multiple versions with the same name are preserved so that makes it easy to know what the most recent was.

          Thanks!

          Show
          erickerickson Erick Erickson added a comment - Gus: It's always appropriate attach a patch! We usually label them SOLR-7441 .patch in this case, don't worry multiple versions with the same name are preserved so that makes it easy to know what the most recent was. Thanks!
          Hide
          gus_heck Gus Heck added a comment -

          Patch vs 5.x branch for passing exceptions that happen during reading of the stream on the server back to the client.

          Show
          gus_heck Gus Heck added a comment - Patch vs 5.x branch for passing exceptions that happen during reading of the stream on the server back to the client.
          Hide
          gus_heck Gus Heck added a comment -

          oops, let me send another, just realized that a file got missed (also want to take out spurious whitespace changes that my IDE did)

          Show
          gus_heck Gus Heck added a comment - oops, let me send another, just realized that a file got missed (also want to take out spurious whitespace changes that my IDE did)
          Hide
          gus_heck Gus Heck added a comment -

          Improved & complete patch

          Show
          gus_heck Gus Heck added a comment - Improved & complete patch
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Let's break this into two tickets. One for the specific logging issue with the fl and another ticket for error propagation.

          Show
          joel.bernstein Joel Bernstein added a comment - Let's break this into two tickets. One for the specific logging issue with the fl and another ticket for error propagation.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          SInce this already has a patch attached, I'll create a ticket for the fl logging issue

          Show
          joel.bernstein Joel Bernstein added a comment - - edited SInce this already has a patch attached, I'll create a ticket for the fl logging issue
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Erick Erickson, just ran a test case with an incorrect fl and got the error below. So I think there is another factor involved when the error doesn't appear in the logs. I'll see if I can reproduce the issue. Let me know if you can reproduce it.

          Throwable #1: org.apache.solr.common.SolrException: undefined field: "blah1"
          [junit4] > at __randomizedtesting.SeedInfo.seed([1A2758A869E78AB2:AED7CFC964A88012]:0)
          [junit4] > at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1223)
          [junit4] > at org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:232)
          [junit4] > at org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120)
          [junit4] > at org.apache.solr.util.TestHarness.query(TestHarness.java:326)
          [junit4] > at org.apache.solr.util.TestHarness.query(TestHarness.java:302)
          [junit4] > at org.apache.solr.response.TestSortingResponseWriter.testSortingOutput(TestSortingResponseWriter.java:145)
          [junit4] > at java.lang.Thread.run(Thread.java:745)

          Show
          joel.bernstein Joel Bernstein added a comment - Erick Erickson , just ran a test case with an incorrect fl and got the error below. So I think there is another factor involved when the error doesn't appear in the logs. I'll see if I can reproduce the issue. Let me know if you can reproduce it. Throwable #1: org.apache.solr.common.SolrException: undefined field: "blah1" [junit4] > at __randomizedtesting.SeedInfo.seed( [1A2758A869E78AB2:AED7CFC964A88012] :0) [junit4] > at org.apache.solr.schema.IndexSchema.getField(IndexSchema.java:1223) [junit4] > at org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:232) [junit4] > at org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:120) [junit4] > at org.apache.solr.util.TestHarness.query(TestHarness.java:326) [junit4] > at org.apache.solr.util.TestHarness.query(TestHarness.java:302) [junit4] > at org.apache.solr.response.TestSortingResponseWriter.testSortingOutput(TestSortingResponseWriter.java:145) [junit4] > at java.lang.Thread.run(Thread.java:745)
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Also I'm assuming you're using the "/export" handler.

          Show
          joel.bernstein Joel Bernstein added a comment - Also I'm assuming you're using the "/export" handler.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Ok, found the problem. The exception is being thrown but not logged. So it appears in the test and the browser but not in the log. I'll figure out why.

          Show
          joel.bernstein Joel Bernstein added a comment - Ok, found the problem. The exception is being thrown but not logged. So it appears in the test and the browser but not in the log. I'll figure out why.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          SOLR-7472 has a patch which resolves the fl logging issue. This fix will be in 5.2.

          We can continue to work with Gus's patch in this ticket.

          Show
          joel.bernstein Joel Bernstein added a comment - SOLR-7472 has a patch which resolves the fl logging issue. This fix will be in 5.2. We can continue to work with Gus's patch in this ticket.
          Hide
          gus_heck Gus Heck added a comment -

          I'm sure there are plenty of more important patches out there, but I'd be interested to know if anyone has feedback on my patch.

          To summarize the patch, I catch any exception, serialize it, toss it in the maps with a special key (there seem to be a couple of those already). Then I check for the presence of an exception on the other end, and deserialize/throw rather than adding to the TreeSet when I find an exception. This is nice since if you are running the code from your IDE while developing, it shows up in the local output, and the stack trace becomes clickable (if your ide does that). In normal production use, it will allow the exception to show up in the logs for the client that made the request rather than showing up in the log of one of the N nodes in the cluster. One could event catch and handle different exceptions from the worker nodes differently. It does however leave the client with an incomplete set of tuples, so perhaps those should be cleared when an exception is found, or perhaps there should be best-effort & fail-fast modes, with best-effort collecting a list of exceptions instead?

          An additional enhancement might be to find a way to identify the node on which this exception occurred and wrap the original exception in an exception with the node identity information added to the message before serializing it, but I thought of that after I submitted the patch and haven't actually gone back to attempt it yet.

          Show
          gus_heck Gus Heck added a comment - I'm sure there are plenty of more important patches out there, but I'd be interested to know if anyone has feedback on my patch. To summarize the patch, I catch any exception, serialize it, toss it in the maps with a special key (there seem to be a couple of those already). Then I check for the presence of an exception on the other end, and deserialize/throw rather than adding to the TreeSet when I find an exception. This is nice since if you are running the code from your IDE while developing, it shows up in the local output, and the stack trace becomes clickable (if your ide does that). In normal production use, it will allow the exception to show up in the logs for the client that made the request rather than showing up in the log of one of the N nodes in the cluster. One could event catch and handle different exceptions from the worker nodes differently. It does however leave the client with an incomplete set of tuples, so perhaps those should be cleared when an exception is found, or perhaps there should be best-effort & fail-fast modes, with best-effort collecting a list of exceptions instead? An additional enhancement might be to find a way to identify the node on which this exception occurred and wrap the original exception in an exception with the node identity information added to the message before serializing it, but I thought of that after I submitted the patch and haven't actually gone back to attempt it yet.
          Hide
          gus_heck Gus Heck added a comment -

          any thoughts on this?

          Show
          gus_heck Gus Heck added a comment - any thoughts on this?
          Hide
          gus_heck Gus Heck added a comment -

          Will this make 5.2? Is there anything you need from me?

          Show
          gus_heck Gus Heck added a comment - Will this make 5.2? Is there anything you need from me?
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I haven't yet had a chance to review the ticket, but I will try to get to it soon. I have a pretty big backlog of work on the Streaming API, Streaming Expressions and Parallel SQL stack, but error handling is an important part of the stack.

          Show
          joel.bernstein Joel Bernstein added a comment - I haven't yet had a chance to review the ticket, but I will try to get to it soon. I have a pretty big backlog of work on the Streaming API, Streaming Expressions and Parallel SQL stack, but error handling is an important part of the stack.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Gus Heck, I've had a chance to review the patch. I think there are some things in there that we can use. The idea of having a field in the Tuple to hold the error makes it easy to propagate errors through the tiers. The trick will be getting the error added to an outgoing tuple in all the different places in the stack where an error can originate. Below is a preliminary list of the areas where exceptions may originate in the stack. Some of the areas are handled by the patch you provided.

          1) Error from the /select handler before the stream is started (JSONResponseWriter)
          2) Error from the /export handler before the stream is started (SortingResponseWriter)
          3) Error midstream from the /select handler;
          4) Error midstream from the /export handler.
          5) Error from the Streaming API, before the request is sent. (Non-parallel mode)
          6) Error from the Streaming API, before the request is sent. (In Parallel mode or on the worker)
          7) Error from the SQL Parser (SQLHandler)

          Show
          joel.bernstein Joel Bernstein added a comment - Gus Heck , I've had a chance to review the patch. I think there are some things in there that we can use. The idea of having a field in the Tuple to hold the error makes it easy to propagate errors through the tiers. The trick will be getting the error added to an outgoing tuple in all the different places in the stack where an error can originate. Below is a preliminary list of the areas where exceptions may originate in the stack. Some of the areas are handled by the patch you provided. 1) Error from the /select handler before the stream is started (JSONResponseWriter) 2) Error from the /export handler before the stream is started (SortingResponseWriter) 3) Error midstream from the /select handler; 4) Error midstream from the /export handler. 5) Error from the Streaming API, before the request is sent. ( Non-parallel mode ) 6) Error from the Streaming API, before the request is sent. ( In Parallel mode or on the worker ) 7) Error from the SQL Parser (SQLHandler)
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Added a patch that adds an Exception handling framework to the Streaming API.

          The new ExceptionStream wraps a TupleStream and catches any exceptions from the underlying Stream. It then logs the exception and returns a Tuple with the exception message included.

          The SolrStream has some code added to look for Tuples with an
          Exception message and then to throw an Exception passing along the exception message from the Tuple.

          This effectively allows exceptions to be passed through any number of distributed tiers.

          The nice thing about this design is that the SolrStream throws and the ExceptionStream catches. All other Streams in between can ignore exception handling entirely.

          No tests yet to prove this concept actually works, but it looks promising.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Added a patch that adds an Exception handling framework to the Streaming API. The new ExceptionStream wraps a TupleStream and catches any exceptions from the underlying Stream. It then logs the exception and returns a Tuple with the exception message included. The SolrStream has some code added to look for Tuples with an Exception message and then to throw an Exception passing along the exception message from the Tuple. This effectively allows exceptions to be passed through any number of distributed tiers. The nice thing about this design is that the SolrStream throws and the ExceptionStream catches. All other Streams in between can ignore exception handling entirely. No tests yet to prove this concept actually works, but it looks promising.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Patch with all tests passing

          Show
          joel.bernstein Joel Bernstein added a comment - Patch with all tests passing
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1689045 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1689045 ]

          SOLR-7441, SOLR-7647: Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1689045 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1689045 ] SOLR-7441 , SOLR-7647 : Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1689559 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1689559 ]

          SOLR-7441: Disable failing test

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1689559 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1689559 ] SOLR-7441 : Disable failing test
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1692193 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1692193 ]

          SOLR-7441:Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1692193 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1692193 ] SOLR-7441 :Improve overall robustness of the Streaming stack: Streaming API, Streaming Expressions, Parallel SQL
          Hide
          gus_heck Gus Heck added a comment -

          Joel Bernstein thx for reviewing this in and carrying it forward. I got behind on my list mail and didn't see your comments. Sorry!

          Show
          gus_heck Gus Heck added a comment - Joel Bernstein thx for reviewing this in and carrying it forward. I got behind on my list mail and didn't see your comments. Sorry!
          Hide
          gus_heck Gus Heck added a comment -

          Also, if you or anyone reading this has the karma, please disable the gus.heck@olin.edu account. Mail to it (and thus the inline references you used here) don't actually reach me. I haven't worked there since 2004, and there is no way for me to regain control of that account using that address.

          Show
          gus_heck Gus Heck added a comment - Also, if you or anyone reading this has the karma, please disable the gus.heck@olin.edu account. Mail to it (and thus the inline references you used here) don't actually reach me. I haven't worked there since 2004, and there is no way for me to regain control of that account using that address.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1694909 from Joel Bernstein in branch 'dev/trunk'
          [ https://svn.apache.org/r1694909 ]

          SOLR-7441: Updated CHANGES.txt

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1694909 from Joel Bernstein in branch 'dev/trunk' [ https://svn.apache.org/r1694909 ] SOLR-7441 : Updated CHANGES.txt
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Release with Solr 6

          Show
          joel.bernstein Joel Bernstein added a comment - Release with Solr 6

            People

            • Assignee:
              joel.bernstein Joel Bernstein
              Reporter:
              erickerickson Erick Erickson
            • Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development