Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
Twitter Q2 Sprint 3 - 5/11
-
5
Description
For large clusters, when a lot of slaves are registering, the master gets backlogged processing registration requests. perf revealed the following:
Events: 14K cycles 25.44% libmesos-0.22.0-x.so [.] mesos::internal::master::Master::registerSlave(process::UPID const&, mesos::SlaveInfo const&, std::vector<mesos::Resource, std::allocator<mesos::Resource> > cons 11.18% libmesos-0.22.0-x.so [.] pipecb 5.88% libc-2.5.so [.] malloc_consolidate 5.33% libc-2.5.so [.] _int_free 5.25% libc-2.5.so [.] malloc 5.23% libc-2.5.so [.] _int_malloc 4.11% libstdc++.so.6.0.8 [.] std::string::assign(std::string const&) 3.22% libmesos-0.22.0-x.so [.] mesos::Resource::SharedDtor() 3.10% [kernel] [k] _raw_spin_lock 1.97% libmesos-0.22.0-x.so [.] mesos::Attribute::SharedDtor() 1.28% libc-2.5.so [.] memcmp 1.08% libc-2.5.so [.] free
This is likely because we loop over all the slaves for each registration:
void Master::registerSlave( const UPID& from, const SlaveInfo& slaveInfo, const vector<Resource>& checkpointedResources, const string& version) { // ... // Check if this slave is already registered (because it retries). foreachvalue (Slave* slave, slaves.registered) { if (slave->pid == from) { // ... } } // ... }