Details

    • Type: Bug
    • Status: In Progress
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: Impala 2.2.4
    • Fix Version/s: None
    • Component/s: Distributed Exec
    • Labels:

      Description

      Impala relies on Thrift 0.9.X for its RPC implementation. This is a venerable release and is showing its age. Problems include:

      • Low-quality SASL support (that we implemented ourselves)
      • Lack of high-quality nonblocking server (TNonBlockingServer is ok, but doesn't work with SASL, making it a non-starter for us. It is also very hard to provide support for sessions with TNonBlockingServer).
      • Lack of 0-copy native bytes type, meaning that large data structures are inefficient to send
      • Lack of support for async. server implementation, so expensive RPCs can consume threads that could have been used by cheap ones.

      Both Kudu and fbthrift have RPC implementations that address some or all of these shortcomings. We should evaluate both and commit to moving our RPC stack to one or the other.

        Issue Links

        1.
        Add Protobuf 2.6.1 to toolchain and as a build dependency Sub-task Resolved Henry Robinson
         
        2.
        Add libev 4.2.0 to toolchain and as a build dependency Sub-task Resolved Henry Robinson
         
        3.
        Add crcutil to toolchain Sub-task Resolved Henry Robinson
         
        4.
        Add Kudu's RPC, util and security libraries Sub-task Resolved Henry Robinson
         
        5. Add Kerberos minicluster test framework Sub-task In Progress Sailesh Mukil
         
        6.
        Add RpcMgr to interface between Impala and KRPC library Sub-task Resolved Michael Ho
         
        7.
        Upgrade gutil to recent Kudu version Sub-task Resolved Henry Robinson
         
        8. Port datastream portions of ImpalaInternalService to KRPC Sub-task Open Michael Ho
         
        9. Increase maximum KRPC message size Sub-task Open Michael Ho
         
        10. Replace kudu::ServicePool with one that uses Impala threads Sub-task In Progress Sailesh Mukil
         
        11.
        Refactor CreateImpalaServer() to allow it to be used in tests. Sub-task Resolved Sailesh Mukil
         
        12.
        Remove per-RPC DNS lookup Sub-task Resolved Michael Ho
         
        13.
        Upgrade LZ4 to recent version Sub-task Resolved Henry Robinson
         
        14. Enable KRPC Kerberos support in Impala Sub-task In Progress Sailesh Mukil
         
        15. Enable KRPC TLS in Impala Sub-task In Progress Sailesh Mukil
         
        16.
        Suppress kudu flags that aren't relevant to Impala Sub-task Resolved Sailesh Mukil
         
        17.
        Upgrade glog and gflags to most recent releases Sub-task Resolved Henry Robinson
         
        18.
        KRPC DCHECK hit when closing DataStreamRecvr Sub-task Resolved Henry Robinson
         
        19. Don't call CancelInternal() from Coordinator::UpdateFragmentExecStatus() Sub-task Open Joe McDonnell
         
        20.
        Add Protobuf headers to Impala-lzo Sub-task Resolved Michael Ho
         
        21.
        Add krb5 to toolchain Sub-task Resolved Henry Robinson
         
        22.
        Openssl 1.0.0 shared library support for legacy platform Sub-task Resolved Michael Ho
         
        23. tcmalloc contention much higher with concurrency after KRPC patch Sub-task Open Michael Ho
         
        24. Concurrent hung with lots of fragments blocked in KrpcDataStreamSender::Channel::WaitForRpc(std::unique_lock Sub-task Open Michael Ho
         
        25. Coordinator should timeout a connection for an unresponsive backend Sub-task Open Sailesh Mukil
         
        26. RowBatch is serialized once per destination channel for Broadcast exchange Sub-task In Progress Michael Ho
         
        27. Queries make very slow progress and report WaitForRPC() stuck for too long Sub-task Open Sailesh Mukil
         
        28.
        Address log spew originating from InboundCall::Respond() Sub-task Resolved Sailesh Mukil
         

          Activity

          Hide
          kwho Michael Ho added a comment -

          IMPALA-2990 may occur more frequently once Impalad is able to run in a larger scale with KRPC. The coordinator can easily be overwhelmed due to status updates and runtime filter propagation. An execution thread can easily decide to cancel all fragment instances on a particular backend after failing to report the status to the overloaded coordinator for 3 times. The coordinator will not be aware of such local cancellation and the rest of the fragment instances may be hung waiting for more data from a data stream sender without realizing that the sender's fragment was cancelled already.

          Show
          kwho Michael Ho added a comment - IMPALA-2990 may occur more frequently once Impalad is able to run in a larger scale with KRPC. The coordinator can easily be overwhelmed due to status updates and runtime filter propagation. An execution thread can easily decide to cancel all fragment instances on a particular backend after failing to report the status to the overloaded coordinator for 3 times. The coordinator will not be aware of such local cancellation and the rest of the fragment instances may be hung waiting for more data from a data stream sender without realizing that the sender's fragment was cancelled already.
          Hide
          henryr Henry Robinson added a comment -

          ryan zapanta - glad to hear it! I realise I haven't linked the branch that we're working on from this ticket. It's at https://github.com/henryr/Impala/commits/krpc, and we're pretty far along - Impala passes all tests running over KRPC, and can run TPCH and TPCDS reliably on some medium-sized clusters.

          Show
          henryr Henry Robinson added a comment - ryan zapanta - glad to hear it! I realise I haven't linked the branch that we're working on from this ticket. It's at https://github.com/henryr/Impala/commits/krpc , and we're pretty far along - Impala passes all tests running over KRPC, and can run TPCH and TPCDS reliably on some medium-sized clusters.
          Hide
          ryanz_impala_dcf5 ryan zapanta added a comment -

          this fix would help tremendously at Rockstar Games

          Show
          ryanz_impala_dcf5 ryan zapanta added a comment - this fix would help tremendously at Rockstar Games
          Hide
          fish0515_impala_49b1 fishing added a comment -

          just now One did not find

          Show
          fish0515_impala_49b1 fishing added a comment - just now One did not find
          Hide
          henryr Henry Robinson added a comment -

          Yes, it's similar to Flatbuffers etc. Did you have a use case in mind?

          Show
          henryr Henry Robinson added a comment - Yes, it's similar to Flatbuffers etc. Did you have a use case in mind?
          Hide
          fish0515_impala_49b1 fishing added a comment -

          have you even known about https://capnproto.org/

          Show
          fish0515_impala_49b1 fishing added a comment - have you even known about https://capnproto.org/
          Hide
          fish0515_impala_49b1 fishing added a comment -


          In the short term, I think the smallest change is add ThransmiteDataServer , other server should be kept as it is 。 What's your idea ?

          Show
          fish0515_impala_49b1 fishing added a comment - In the short term, I think the smallest change is add ThransmiteDataServer , other server should be kept as it is 。 What's your idea ?
          Hide
          fish0515_impala_49b1 fishing added a comment -

          sorry , now ThransmiteDataServer hs not impl , we had just change rpc apache thrift to fbthrift . use cpp1 which is very like Apache thrift perf Not improve

          Show
          fish0515_impala_49b1 fishing added a comment - sorry , now ThransmiteDataServer hs not impl , we had just change rpc apache thrift to fbthrift . use cpp1 which is very like Apache thrift perf Not improve
          Hide
          henryr Henry Robinson added a comment -

          That sounds great! How was the performance? Did you see a measurable improvement?

          Show
          henryr Henry Robinson added a comment - That sounds great! How was the performance? Did you see a measurable improvement?
          Hide
          fish0515_impala_49b1 fishing added a comment -

          if you want we could share some code to you

          Show
          fish0515_impala_49b1 fishing added a comment - if you want we could share some code to you
          Hide
          fish0515_impala_49b1 fishing added a comment -

          HI Henry
          we have implement part of impala rpc use fbthrift , we add ThransmiteDataServer [ use cpp2 ], others fbthrift[ cpp ] is compatible with Apache thrift

          now our online business running base on fbthrift[ cpp ] , now dev ThransmiteDataServer use fbthrfit [cpp2]

          Show
          fish0515_impala_49b1 fishing added a comment - HI Henry we have implement part of impala rpc use fbthrift , we add ThransmiteDataServer [ use cpp2 ], others fbthrift[ cpp ] is compatible with Apache thrift now our online business running base on fbthrift[ cpp ] , now dev ThransmiteDataServer use fbthrfit [cpp2]
          Hide
          henryr Henry Robinson added a comment -

          Hi fishing!

          I agree fully, there's a severe need to rebuild Impala's RPC stack from the ground up. Of course, this is quite a big undertaking...

          Here, as I see it right now, are the requirements for the next iteration of RPC in Impala:

          • Event-based multiplexed send / recv - this is a no-brainer, but Impala's RPC right now uses the ancient thread-per-connection model, and scales very poorly as a result. Although Thrift 0.9.X includes a non-blocking server, we didn't find it to be very stable, or compatible with our security requirements. Any RPC implementation must use a fixed number of threads for handling a variable number of RPCs in flight.
          • Security features: SASL, Kerberos, SSL - any RPC implementation must allow us to speak Kerberos (preferably over SASL), and support SSL. I would prefer this to be native support, since we've found that maintaining a separate SASL transport for Thrift is difficult and prone to bugs and performance problems.
          • Efficient handling of large RPC payloads - in particular, zero-copy after deserialization, and 'raw' payloads. By 'raw', I mean the ability to send bytes on the wire without having them serialized into a transport-specific data structure. In Impala this comes up when we want to send row batches in TransmitData() - we convert to a self-contained, relative-addressed data structure, and then have to serialise as a Thrift binary field. One other thing that would be nice to have is streaming, where we don't have to send the total length up-front. This would allow TransmitData() to use effectively a single RPC to send its entire row stream.
          • Asynchronous RPC - we simulate asynchronous RPC by having a fixed set of threads that can send RPCs. Instead, it would be good to write a loop that looks like the following fork-join pattern:
          for (dest: destinations) {
            DoASyncRpc(dest, payload);
          }
          
          for (dest:destinations) {
            WaitOnRPC(dest);
          }
          

          This would help sending runtime filters, statestore updates, and starting remote fragments, amongst other things.

          • Ease of contribution - whatever framework we use may need tailoring to our needs. It would be good to have an RPC stack that we can easily make changes to. This has been a problem with Thrift 0.9.0.

          What framework to use?

          There are a few RPC stacks that are worthy of consideration. I've considered three carefully:

          1. Google RPC
          2. fbthrift
          3. Kudu RPC

          Although fbthrift has a lot of advantages, not least the fact that it's backwards compatible with Thrift, there are some issues that concern me. Chief amongst them is the increased reliance on Boost and Folly - we're trying to move away from the former, and don't really want to add another dependency on the latter. Finally, fbthrift is a Facebook project that's actively maintained, which is awesome, but may not be as easy to contribute to, given how important the stack clearly is to FB's operations.

          Instead, I'm inclined to move towards Kudu RPC. It satisfies most of the features above (although I think we'll need to add SSL). It is also based on an Apache project, which means that contribution / code-sharing is easier and encouraged. Some of Kudu's utility code was originally based on Impala's, so there's a really good opportunity to build a common set of C++ utilities for ASF projects.

          (BTW, we need to keep Thrift in the client-facing APIs for the time being - we can't break clients that were using HS2 by changing the serialisation format

          I'm writing a design document for this change, and hope to post it in a couple of weeks for community review.

          Henry

          Show
          henryr Henry Robinson added a comment - Hi fishing ! I agree fully, there's a severe need to rebuild Impala's RPC stack from the ground up. Of course, this is quite a big undertaking... Here, as I see it right now, are the requirements for the next iteration of RPC in Impala: Event-based multiplexed send / recv - this is a no-brainer, but Impala's RPC right now uses the ancient thread-per-connection model, and scales very poorly as a result. Although Thrift 0.9.X includes a non-blocking server, we didn't find it to be very stable, or compatible with our security requirements. Any RPC implementation must use a fixed number of threads for handling a variable number of RPCs in flight. Security features: SASL, Kerberos, SSL - any RPC implementation must allow us to speak Kerberos (preferably over SASL), and support SSL. I would prefer this to be native support, since we've found that maintaining a separate SASL transport for Thrift is difficult and prone to bugs and performance problems. Efficient handling of large RPC payloads - in particular, zero-copy after deserialization, and 'raw' payloads. By 'raw', I mean the ability to send bytes on the wire without having them serialized into a transport-specific data structure. In Impala this comes up when we want to send row batches in TransmitData() - we convert to a self-contained, relative-addressed data structure, and then have to serialise as a Thrift binary field. One other thing that would be nice to have is streaming, where we don't have to send the total length up-front. This would allow TransmitData() to use effectively a single RPC to send its entire row stream. Asynchronous RPC - we simulate asynchronous RPC by having a fixed set of threads that can send RPCs. Instead, it would be good to write a loop that looks like the following fork-join pattern: for (dest: destinations) { DoASyncRpc(dest, payload); } for (dest:destinations) { WaitOnRPC(dest); } This would help sending runtime filters, statestore updates, and starting remote fragments, amongst other things. Ease of contribution - whatever framework we use may need tailoring to our needs. It would be good to have an RPC stack that we can easily make changes to. This has been a problem with Thrift 0.9.0. What framework to use? There are a few RPC stacks that are worthy of consideration. I've considered three carefully: Google RPC fbthrift Kudu RPC Although fbthrift has a lot of advantages, not least the fact that it's backwards compatible with Thrift, there are some issues that concern me. Chief amongst them is the increased reliance on Boost and Folly - we're trying to move away from the former, and don't really want to add another dependency on the latter. Finally, fbthrift is a Facebook project that's actively maintained, which is awesome, but may not be as easy to contribute to, given how important the stack clearly is to FB's operations. Instead, I'm inclined to move towards Kudu RPC. It satisfies most of the features above (although I think we'll need to add SSL). It is also based on an Apache project, which means that contribution / code-sharing is easier and encouraged. Some of Kudu's utility code was originally based on Impala's, so there's a really good opportunity to build a common set of C++ utilities for ASF projects. (BTW, we need to keep Thrift in the client-facing APIs for the time being - we can't break clients that were using HS2 by changing the serialisation format I'm writing a design document for this change, and hope to post it in a couple of weeks for community review. Henry
          Hide
          fish0515_impala_49b1 fishing added a comment -

          we want to redesign of network transmission use fbthrift for several advantage zero-copy , async , performance

          Show
          fish0515_impala_49b1 fishing added a comment - we want to redesign of network transmission use fbthrift for several advantage zero-copy , async , performance
          Hide
          fish0515_impala_49b1 fishing added a comment -

          HI Henry Robinson

          what the status about this issue

          Show
          fish0515_impala_49b1 fishing added a comment - HI Henry Robinson what the status about this issue

            People

            • Assignee:
              kwho Michael Ho
              Reporter:
              henryr Henry Robinson
            • Votes:
              5 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:

                Development