Uploaded image for project: 'Calcite'
  1. Calcite
  2. CALCITE-903

Enable client to recover from missing server-side state

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5.0
    • Component/s: avatica
    • Labels:
      None

      Description

      When deploying more than one instance of an avatica-server, we have the desire to treat the collection of servers as a single server. Ideally, we want to have the avatica-client operate in a manner that doesn't expect a server to have specific state For example, the avatica-client should be able to know that when a server doesn't have a statement with the ID the client thinks it should, the client can create a new statement.

      This is desirable because it allows us to use a generic load-balancer between clients and servers without the need for clustering or sticky sessions. The downside is that in the face of failure, operations will take longer than normal. Even with the performance hit, as long as an avatica-server exists, the client can still retrieve the results for some query which is ideal (tl;dr it will take longer, but the client still gets the answer).

      Two major areas that need to be addressed presently are:

      1. Automatic recreation of Statements when they are not cached
      2. Recreation of ResultSets to resume iteration (for fetch()). This depends on "stable results" by the underlying JDBC driver, otherwise external synchronization would be required. This is considered a prerequisite.

        Issue Links

          Activity

          Hide
          jcamachorodriguez Jesus Camacho Rodriguez added a comment -

          Resolved in release 1.5.0 (2015-11-10)

          Show
          jcamachorodriguez Jesus Camacho Rodriguez added a comment - Resolved in release 1.5.0 (2015-11-10)
          Hide
          elserj Josh Elser added a comment -

          Yay, thanks again, Julian!

          Show
          elserj Josh Elser added a comment - Yay, thanks again, Julian!
          Show
          julianhyde Julian Hyde added a comment - Fixed in http://git-wip-us.apache.org/repos/asf/calcite/commit/97df1acb .
          Hide
          julianhyde Julian Hyde added a comment -

          Tests are failing for me with 3 checkstyle exceptions. Please check those out.

          Show
          julianhyde Julian Hyde added a comment - Tests are failing for me with 3 checkstyle exceptions. Please check those out.
          Hide
          elserj Josh Elser added a comment -

          Prelim-patch (PR) incoming once I get the existing unit tests passing. Of technologically cool note: I started two phoenix queryserver instances and put them behind HAProxy and started doing some local tests. Running a query via sqlline that pulled by 25k results, I killed one PQS, and then the client automatically started pulling the rest of the results from the other PQS. I didn't do any more correctness testing other than sqlline reporting 25k results, but I was pleased to see it.

          Example haproxy configuration I used, pointing sqlline at localhost:8888. balance source was the big win which pushes a client to a specific server as long as the state of servers doesn't change. This helps ensure that we don't bounce around to servers recreating state everywhere (as a round-robin approach would likely spin indefinitely).

          global
            maxconn 256
          
          defaults
            mode http
            timeout connect 5000ms
            timeout client 50000ms
            timeout server 50000ms
          
          frontend http-in
            bind  *:8888
            default_backend servers
          
          backend servers
            balance source
            server server1 127.0.0.1:8765 check
            server server2 127.0.0.1:8766 check
          
          Show
          elserj Josh Elser added a comment - Prelim-patch (PR) incoming once I get the existing unit tests passing. Of technologically cool note: I started two phoenix queryserver instances and put them behind HAProxy and started doing some local tests. Running a query via sqlline that pulled by 25k results, I killed one PQS, and then the client automatically started pulling the rest of the results from the other PQS. I didn't do any more correctness testing other than sqlline reporting 25k results, but I was pleased to see it. Example haproxy configuration I used, pointing sqlline at localhost:8888. balance source was the big win which pushes a client to a specific server as long as the state of servers doesn't change. This helps ensure that we don't bounce around to servers recreating state everywhere (as a round-robin approach would likely spin indefinitely). global maxconn 256 defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms frontend http-in bind *:8888 default_backend servers backend servers balance source server server1 127.0.0.1:8765 check server server2 127.0.0.1:8766 check
          Hide
          elserj Josh Elser added a comment -

          Note that I fixed CALCITE-843 yesterday, which touched the logic to synchronize session state.

          Ok, thanks for the pointer!

          Show
          elserj Josh Elser added a comment - Note that I fixed CALCITE-843 yesterday, which touched the logic to synchronize session state. Ok, thanks for the pointer!
          Hide
          julianhyde Julian Hyde added a comment -

          Sounds good. Note that I fixed CALCITE-843 yesterday, which touched the logic to synchronize session state.

          Show
          julianhyde Julian Hyde added a comment - Sounds good. Note that I fixed CALCITE-843 yesterday, which touched the logic to synchronize session state.
          Hide
          elserj Josh Elser added a comment -

          High-level before I get a patch put up here... I think my approach would fall into your 3rd point. Client can recover when a server responds in a certain way (because it doesn't have the necessary state). I liked this approach because it let me reuse a bunch of the existing RPC logic instead of trying to re-hash how Fetch fundamentally works. On the client it's more or less:

          while(true) {
            try {
              fetch();
              return;
            } catch (missing_state_1) {
              reset_necessary_state1();
              continue;
            } catch (missing_state_2) {
              reset_necessary_state2();
              continue;
            }
          }
          
          • Introduced a new attribute sent back to the client to denote that the requested statement doesn't exist and that the client should create a new Statement.
          • Introduced a new attribute sent back to the client to denote that the requests ResultSet (on a Statement) doesn't exist and the client needs to "recreate" that ResultSet on the server.
          • Introduced a new RPC method to recreate the ResultSet in the server (either created by a DatabaseMetaData operation or some SQL) that a client can call when FetchResponse contains a new attribute (said attribute is essentially a marker for the client to know that it should call this new RPC method). Aside: I think there might be a way to actually work around a new RPC method, but we can chat that over when I post some code
          Show
          elserj Josh Elser added a comment - High-level before I get a patch put up here... I think my approach would fall into your 3rd point. Client can recover when a server responds in a certain way (because it doesn't have the necessary state). I liked this approach because it let me reuse a bunch of the existing RPC logic instead of trying to re-hash how Fetch fundamentally works. On the client it's more or less: while ( true ) { try { fetch(); return ; } catch (missing_state_1) { reset_necessary_state1(); continue ; } catch (missing_state_2) { reset_necessary_state2(); continue ; } } Introduced a new attribute sent back to the client to denote that the requested statement doesn't exist and that the client should create a new Statement. Introduced a new attribute sent back to the client to denote that the requests ResultSet (on a Statement) doesn't exist and the client needs to "recreate" that ResultSet on the server. Introduced a new RPC method to recreate the ResultSet in the server (either created by a DatabaseMetaData operation or some SQL) that a client can call when FetchResponse contains a new attribute (said attribute is essentially a marker for the client to know that it should call this new RPC method). Aside: I think there might be a way to actually work around a new RPC method, but we can chat that over when I post some code
          Hide
          julianhyde Julian Hyde added a comment -

          Is the idea that the client will ALWAYS send the full session state? Or to have a mode where the client sends the full session state? Or to have a recovery process after which the state has been re-created on another session server and the client can go back to sending just connection and statement IDs?

          Show
          julianhyde Julian Hyde added a comment - Is the idea that the client will ALWAYS send the full session state? Or to have a mode where the client sends the full session state? Or to have a recovery process after which the state has been re-created on another session server and the client can go back to sending just connection and statement IDs?
          Hide
          elserj Josh Elser added a comment -

          I've been hacking on a prototype to enable this by starting two PQS instances, and making a dirty hack to the client to randomly choose one server or the other. The expected outcome is that the client should see the same results, regardless of which PQS instance actually returned the results. This goes as far to have PQS1 return the first frame, and PQS2 returns the rest of the frames. I have some initial queries working, but need to write many more tests before I feel comfortable with it.

          Show
          elserj Josh Elser added a comment - I've been hacking on a prototype to enable this by starting two PQS instances, and making a dirty hack to the client to randomly choose one server or the other. The expected outcome is that the client should see the same results, regardless of which PQS instance actually returned the results. This goes as far to have PQS1 return the first frame, and PQS2 returns the rest of the frames. I have some initial queries working, but need to write many more tests before I feel comfortable with it.

            People

            • Assignee:
              elserj Josh Elser
              Reporter:
              elserj Josh Elser
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development