Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1865

Create fast path for RespondSuccess() in KRPC

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • perf, rpc

    Description

      A lot of RPCs just respond with RespondSuccess() which returns the exact payload every time. This takes the same path as any other response by ultimately calling Connection::QueueResponseForCall() which has a few small allocations. These small allocations (and their corresponding deallocations) are called quite frequently (once for every IncomingCall) and end up taking quite some time in the kernel (traversing the free list, spin locks etc.)

      This was found when mmokhtar ran some profiles on Impala over KRPC on a 20 node cluster and found the following:

      The exact % of time spent is hard to quantify from the profiles, but these were the among the top 5 of the slowest stacks:

      impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
      impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
      impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
      impalad ! operator delete + 0x329 - [unknown source file]
      impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - new_allocator.h:110
      impalad ! std::_Vector_base<kudu::Slice, std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178
      impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
      impalad ! ~vector - stl_vector.h:425                           <----Deleting 'slices' vector
      impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - connection.cc:433
      impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
      impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
      impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
      ..
      
      impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source file]
      impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown source file]
      impalad ! operator new + 0x297 - [unknown source file]        <--- Creating new 'OutboundTransferTask' object.
      impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - connection.cc:432
      impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
      impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
      impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
      ...
      

      Even creating and deleting the 'RpcContext' takes a lot of time:

      impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
      impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
      impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
      impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
      impalad ! operator delete + 0x329 - [unknown source file]
      impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - impala_internal_service.pb.cc:1221
      impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - impala_internal_service.pb.cc:1222
      impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - gscoped_ptr.h:145
      impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228
      impalad ! ~gscoped_ptr - gscoped_ptr.h:318
      impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53   <-----
      impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67
      

      The above show that creating these small objects under moderately heavy load results in heavy contention in the kernel. We will benefit a lot if we create a fast path for 'RespondSuccess'.

      My suggestion is to create all these small objects at once along with the 'InboundCall' object while it is being created, in a 'RespondSuccess' structure, and just use that structure when we want to send 'success' back to the sender. This would already contain the 'OutboundTransferTask', a 'Slice' with 'success', etc. We would expect that most RPCs respond with 'success' a majority of the time.

      How this would benefit us is that we don't go back and forth every time to allocate and deallocate memory for these small objects, instead we do it all at once while creating the 'InboundCall' object.

      I just wanted to start a discussion about this, so even if what I suggested seems a little off, hopefully we can move forward with this on some level.

      Attachments

        1. cross-thread.txt
          7 kB
          Todd Lipcon
        2. alloc-pattern.py
          0.9 kB
          Todd Lipcon

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sailesh Sailesh Mukil
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: