Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
A lot of RPCs just respond with RespondSuccess() which returns the exact payload every time. This takes the same path as any other response by ultimately calling Connection::QueueResponseForCall() which has a few small allocations. These small allocations (and their corresponding deallocations) are called quite frequently (once for every IncomingCall) and end up taking quite some time in the kernel (traversing the free list, spin locks etc.)
This was found when mmokhtar ran some profiles on Impala over KRPC on a 20 node cluster and found the following:
The exact % of time spent is hard to quantify from the profiles, but these were the among the top 5 of the slowest stacks:
impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file] impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file] impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file] impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file] impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file] impalad ! operator delete + 0x329 - [unknown source file] impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - new_allocator.h:110 impalad ! std::_Vector_base<kudu::Slice, std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178 impalad ! ~_Vector_base + 0x4 - stl_vector.h:160 impalad ! ~vector - stl_vector.h:425 <----Deleting 'slices' vector impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - connection.cc:433 impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133 impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77 impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66 ..
impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file] impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source file] impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown source file] impalad ! operator new + 0x297 - [unknown source file] <--- Creating new 'OutboundTransferTask' object. impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - connection.cc:432 impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133 impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77 impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66 ...
Even creating and deleting the 'RpcContext' takes a lot of time:
impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file] impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file] impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file] impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file] impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file] impalad ! operator delete + 0x329 - [unknown source file] impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - impala_internal_service.pb.cc:1221 impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - impala_internal_service.pb.cc:1222 impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - gscoped_ptr.h:145 impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228 impalad ! ~gscoped_ptr - gscoped_ptr.h:318 impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53 <----- impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67
The above show that creating these small objects under moderately heavy load results in heavy contention in the kernel. We will benefit a lot if we create a fast path for 'RespondSuccess'.
My suggestion is to create all these small objects at once along with the 'InboundCall' object while it is being created, in a 'RespondSuccess' structure, and just use that structure when we want to send 'success' back to the sender. This would already contain the 'OutboundTransferTask', a 'Slice' with 'success', etc. We would expect that most RPCs respond with 'success' a majority of the time.
How this would benefit us is that we don't go back and forth every time to allocate and deallocate memory for these small objects, instead we do it all at once while creating the 'InboundCall' object.
I just wanted to start a discussion about this, so even if what I suggested seems a little off, hopefully we can move forward with this on some level.
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-5135 KRPC : ReportExecStatus RPC can timeout when deserializing large query profiles due to tcmalloc contention
- Resolved
-
IMPALA-5528 tcmalloc contention much higher with concurrency after KRPC patch
- Resolved