Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10146

Removing task from slave when framework is disconnected causes master to crash

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.9.0
    • None
    • c++ api, framework
    • None
    • Mesos master with three master nodes

    Description

      Hello, 

          we want to report an issue we observed when remove tasks from slave. There is condition to check for valid framework before tasks can be removed. There can be several reasons framework can be disconnected. This check fails and crashes mesos master node. 

      https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842

      There is also unguarded access to the internal framework state on line 11853.

      Error logs - 

      mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health check timed out
      mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check failed: framework != nullptr Framework 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } }
      mesos-master[5483]: *** Check failure stack trace: ***
      mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
      mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
      mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica received learned notice for position 42070 from log-network(1)@10.160.73.212:5050
      mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail()
      mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog()
      mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush()
      mesos-master[5483]: @ 0x7f2fdf6a8859 google::LogMessageFatal::~LogMessageFatal()
      mesos-master[5483]: @ 0x7f2fde2677f2 mesos::internal::master::Master::__removeSlave()
      mesos-master[5483]: @ 0x7f2fde267ebe mesos::internal::master::Master::_markUnreachable()
      mesos-master[5483]: @ 0x7f2fde268215 _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbEEEEclEv
      mesos-master[5483]: @ 0x7f2fddf30688 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
      mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume()
      mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume()
      mesos-master[5483]: @ 0x7f2fdf60cb36 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
      mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine
      mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread
      mesos-master[5483]: @ 0x7f2fdb20e8dd __clone
      systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT
      systemd[1]: Unit mesos-master.service entered failed state.
      systemd[1]: mesos-master.service failed.
      systemd[1]: mesos-master.service holdoff time over, scheduling restart.
      systemd[1]: Stopped Mesos Master.
      systemd[1]: Started Mesos Master.
      mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level logging started!
      mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: 2020-05-09 10:42:00 by centos
      mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0
      mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0
      mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sunshine123 Naveen
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: