Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.4.0
-
Mesosphere Sprint 67
-
1
Description
Scheduler library assumes that a connection cannot be interrupted between continuations, for example send() and _send(): https://github.com/apache/mesos/blob/509a1ab3226bbec7c369f431656f4ec692da00ba/src/scheduler/scheduler.cpp#L553. This is not true, detected() can fire in-between, leading to disconnection:
I1107 18:50:57.154796 2138112 scheduler.cpp:496] New master detected at master@192.168.9.40:59063 ... I1107 18:50:57.160935 2138112 scheduler.cpp:505] Waiting for 0ns before initiating a re-(connection) attempt with the master I1107 18:50:57.161245 1064960 clock.cpp:435] Clock of __collect__(7)@192.168.9.40:59063 updated to 2017-11-07 17:50:57.159954176+00:00 I1107 18:50:57.161285 1898086400 clock.cpp:361] Clock resumed at 2017-11-07 17:50:57.159954176+00:00 I1107 18:50:57.161602 1064960 scheduler.cpp:387] Connected with the master at http://192.168.9.40:59063/master/api/v1/scheduler I1107 18:50:57.161779 2138112 scheduler.cpp:249] Sending SUBSCRIBE call to http://192.168.9.40:59063/master/api/v1/scheduler I1107 18:50:57.162037 2138112 scheduler.cpp:496] New master detected at master@192.168.9.40:59063 I1107 18:50:57.162055 2138112 scheduler.cpp:505] Waiting for 0ns before initiating a re-(connection) attempt with the master I1107 18:50:57.162164 4820992 process.cpp:3167] Dropping event for process __http_connection__(14)@192.168.9.40:59063 F1107 18:50:57.162214 2138112 scheduler.cpp:553] CHECK_SOME(connections): is NONE *** Check failure stack trace: *** E1107 18:50:57.162240 4820992 process.cpp:2576] Failed to shutdown socket with fd 9, address 192.168.9.40:59063: Socket is not connected @ 0x10ed262b4 google::LogMessage::Flush() @ 0x10ed2a21f google::LogMessageFatal::~LogMessageFatal() @ 0x10ed26ef9 google::LogMessageFatal::~LogMessageFatal() E1107 18:50:57.162304 4820992 process.cpp:2576] Failed to shutdown socket with fd 10, address 192.168.9.40:59063: Socket is not connected @ 0x1078efaea _CheckFatal::~_CheckFatal() @ 0x1078ea675 _CheckFatal::~_CheckFatal() @ 0x109dfcabf mesos::v1::scheduler::MesosProcess::_send() @ 0x109e07438 _ZZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS3_4CallERKNS_6FutureINS_4http7RequestEEES7_SD_EEvRKNS_3PIDIT_EEMSF_FvT0_T1_EOT2_OT3_ENKUlRS5_RSB_PNS_11ProcessBaseEE_clESR_SS_SU_ @ 0x109e072b7 _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRNS_6__bindIZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS8_4CallERKNS4_6FutureINS4_4http7RequestEEESC_SI_EEvRKNS4_3PIDIT_EEMSK_FvT0_T1_EOT2_OT3_EUlRSA_RSG_PNS4_11ProcessBaseEE_JSC_SI_RNS_12placeholders4__phILi1EEEEEESZ_EEEvDpOT_ @ 0x109e06ba9 _ZNSt3__110__function6__funcINS_6__bindIZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS7_4CallERKNS3_6FutureINS3_4http7RequestEEESB_SH_EEvRKNS3_3PIDIT_EEMSJ_FvT0_T1_EOT2_OT3_EUlRS9_RSF_PNS3_11ProcessBaseEE_JSB_SH_RNS_12placeholders4__phILi1EEEEEENS_9allocatorIS14_EEFvSY_EEclEOSY_ @ 0x10de77d3a std::__1::function<>::operator()() @ 0x10e307abc process::ProcessBase::visit() @ 0x10e3b804e process::DispatchEvent::visit() @ 0x107a4b991 process::ProcessBase::serve() @ 0x10e300191 process::ProcessManager::resume() @ 0x10e42d27d process::ProcessManager::init_threads()::$_2::operator()() @ 0x10e42ce12 _ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_2EEEEEPvS6_ @ 0x7fff8591499d _pthread_body @ 0x7fff8591491a _pthread_start @ 0x7fff85912351 thread_start zsh: abort GLOG_v=2 GTEST_FILTER="*SchedulerTest.MasterFailover*" ./bin/mesos-tests.sh
The bug has been introduced in https://reviews.apache.org/r/62594
Attachments
Attachments
Issue Links
- blocks
-
MESOS-6949 SchedulerTest.MasterFailover is flaky
-
- Resolved
-