Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8179

Scheduler library has incorrect assumptions about connections.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.0
    • 1.5.0
    • scheduler driver
    • Mesosphere Sprint 67
    • 1

    Description

      Scheduler library assumes that a connection cannot be interrupted between continuations, for example send() and _send(): https://github.com/apache/mesos/blob/509a1ab3226bbec7c369f431656f4ec692da00ba/src/scheduler/scheduler.cpp#L553. This is not true, detected() can fire in-between, leading to disconnection:

      I1107 18:50:57.154796 2138112 scheduler.cpp:496] New master detected at master@192.168.9.40:59063
      ...
      I1107 18:50:57.160935 2138112 scheduler.cpp:505] Waiting for 0ns before initiating a re-(connection) attempt with the master
      I1107 18:50:57.161245 1064960 clock.cpp:435] Clock of __collect__(7)@192.168.9.40:59063 updated to 2017-11-07 17:50:57.159954176+00:00
      I1107 18:50:57.161285 1898086400 clock.cpp:361] Clock resumed at 2017-11-07 17:50:57.159954176+00:00
      I1107 18:50:57.161602 1064960 scheduler.cpp:387] Connected with the master at http://192.168.9.40:59063/master/api/v1/scheduler
      I1107 18:50:57.161779 2138112 scheduler.cpp:249] Sending SUBSCRIBE call to http://192.168.9.40:59063/master/api/v1/scheduler
      I1107 18:50:57.162037 2138112 scheduler.cpp:496] New master detected at master@192.168.9.40:59063
      I1107 18:50:57.162055 2138112 scheduler.cpp:505] Waiting for 0ns before initiating a re-(connection) attempt with the master
      I1107 18:50:57.162164 4820992 process.cpp:3167] Dropping event for process __http_connection__(14)@192.168.9.40:59063
      F1107 18:50:57.162214 2138112 scheduler.cpp:553] CHECK_SOME(connections): is NONE 
      *** Check failure stack trace: ***
      E1107 18:50:57.162240 4820992 process.cpp:2576] Failed to shutdown socket with fd 9, address 192.168.9.40:59063: Socket is not connected
          @        0x10ed262b4  google::LogMessage::Flush()
          @        0x10ed2a21f  google::LogMessageFatal::~LogMessageFatal()
          @        0x10ed26ef9  google::LogMessageFatal::~LogMessageFatal()
      E1107 18:50:57.162304 4820992 process.cpp:2576] Failed to shutdown socket with fd 10, address 192.168.9.40:59063: Socket is not connected
          @        0x1078efaea  _CheckFatal::~_CheckFatal()
          @        0x1078ea675  _CheckFatal::~_CheckFatal()
          @        0x109dfcabf  mesos::v1::scheduler::MesosProcess::_send()
          @        0x109e07438  _ZZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS3_4CallERKNS_6FutureINS_4http7RequestEEES7_SD_EEvRKNS_3PIDIT_EEMSF_FvT0_T1_EOT2_OT3_ENKUlRS5_RSB_PNS_11ProcessBaseEE_clESR_SS_SU_
          @        0x109e072b7  _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRNS_6__bindIZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS8_4CallERKNS4_6FutureINS4_4http7RequestEEESC_SI_EEvRKNS4_3PIDIT_EEMSK_FvT0_T1_EOT2_OT3_EUlRSA_RSG_PNS4_11ProcessBaseEE_JSC_SI_RNS_12placeholders4__phILi1EEEEEESZ_EEEvDpOT_
          @        0x109e06ba9  _ZNSt3__110__function6__funcINS_6__bindIZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS7_4CallERKNS3_6FutureINS3_4http7RequestEEESB_SH_EEvRKNS3_3PIDIT_EEMSJ_FvT0_T1_EOT2_OT3_EUlRS9_RSF_PNS3_11ProcessBaseEE_JSB_SH_RNS_12placeholders4__phILi1EEEEEENS_9allocatorIS14_EEFvSY_EEclEOSY_
          @        0x10de77d3a  std::__1::function<>::operator()()
          @        0x10e307abc  process::ProcessBase::visit()
          @        0x10e3b804e  process::DispatchEvent::visit()
          @        0x107a4b991  process::ProcessBase::serve()
          @        0x10e300191  process::ProcessManager::resume()
          @        0x10e42d27d  process::ProcessManager::init_threads()::$_2::operator()()
          @        0x10e42ce12  _ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_2EEEEEPvS6_
          @     0x7fff8591499d  _pthread_body
          @     0x7fff8591491a  _pthread_start
          @     0x7fff85912351  thread_start
      zsh: abort      GLOG_v=2 GTEST_FILTER="*SchedulerTest.MasterFailover*" ./bin/mesos-tests.sh  
      

      The bug has been introduced in https://reviews.apache.org/r/62594

      Attachments

        Issue Links

          Activity

            People

              alexr Alex R
              alexr Alex R
              Till Toenshoff Till Toenshoff
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: