[KUDU-1865] Create fast path for RespondSuccess() in KRPC - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: perf, rpc
Labels:

Description

A lot of RPCs just respond with RespondSuccess() which returns the exact payload every time. This takes the same path as any other response by ultimately calling Connection::QueueResponseForCall() which has a few small allocations. These small allocations (and their corresponding deallocations) are called quite frequently (once for every IncomingCall) and end up taking quite some time in the kernel (traversing the free list, spin locks etc.)

This was found when mmokhtar ran some profiles on Impala over KRPC on a 20 node cluster and found the following:

The exact % of time spent is hard to quantify from the profiles, but these were the among the top 5 of the slowest stacks:

impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
impalad ! operator delete + 0x329 - [unknown source file]
impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 - new_allocator.h:110
impalad ! std::_Vector_base<kudu::Slice, std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178
impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
impalad ! ~vector - stl_vector.h:425                           <----Deleting 'slices' vector
impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac - connection.cc:433
impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
..

impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file]
impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source file]
impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown source file]
impalad ! operator new + 0x297 - [unknown source file]        <--- Creating new 'OutboundTransferTask' object.
impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 - connection.cc:432
impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
...

Even creating and deleting the 'RpcContext' takes a lot of time:

impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown source file]
impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source file]
impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown source file]
impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
impalad ! operator delete + 0x329 - [unknown source file]
impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 - impala_internal_service.pb.cc:1221
impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 - impala_internal_service.pb.cc:1222
impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 - gscoped_ptr.h:145
impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228
impalad ! ~gscoped_ptr - gscoped_ptr.h:318
impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53   <-----
impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67

The above show that creating these small objects under moderately heavy load results in heavy contention in the kernel. We will benefit a lot if we create a fast path for 'RespondSuccess'.

My suggestion is to create all these small objects at once along with the 'InboundCall' object while it is being created, in a 'RespondSuccess' structure, and just use that structure when we want to send 'success' back to the sender. This would already contain the 'OutboundTransferTask', a 'Slice' with 'success', etc. We would expect that most RPCs respond with 'success' a majority of the time.

How this would benefit us is that we don't go back and forth every time to allocate and deallocate memory for these small objects, instead we do it all at once while creating the 'InboundCall' object.

I just wanted to start a discussion about this, so even if what I suggested seems a little off, hopefully we can move forward with this on some level.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cross-thread.txt
04/Feb/17 03:03
7 kB
Todd Lipcon
alloc-pattern.py
04/Feb/17 03:03
0.9 kB
Todd Lipcon

Issue Links

is related to

IMPALA-5135 KRPC : ReportExecStatus RPC can timeout when deserializing large query profiles due to tcmalloc contention

Resolved

IMPALA-5528 tcmalloc contention much higher with concurrency after KRPC patch

Resolved

Create fast path for RespondSuccess() in KRPC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates