Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7991

fatal, check failed !framework->recovered()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Cannot Reproduce
    • None
    • None
    • None
    • Mesosphere Sprint 66, Mesosphere Sprint 67, Mesosphere Sprint 68
    • 3

    Description

      mesos master crashed on what appears to be framework recovery

      mesos master version: 1.3.1
      mesos agent version: 1.3.1

      W0920 14:58:54.756364 25452 master.cpp:7568] Task 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
      @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
      W0920 14:58:54.756369 25452 master.cpp:7568] Task 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
      @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
      W0920 14:58:54.756376 25452 master.cpp:7568] Task 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
      @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
      W0920 14:58:54.756381 25452 master.cpp:7568] Task e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
      @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
      W0920 14:58:54.756386 25452 master.cpp:7568] Task f838a03c-5cd4-47eb-8606-69b004d89808 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
      @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
      W0920 14:58:54.756392 25452 master.cpp:7568] Task 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
      @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
      W0920 14:58:54.756397 25452 master.cpp:7568] Task 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
      @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
      F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: !framework->recovered()
      *** Check failure stack trace: ***
          @     0x7f7bf80087ed  google::LogMessage::Fail()
          @     0x7f7bf800a5a0  google::LogMessage::SendToLog()
          @     0x7f7bf80083d3  google::LogMessage::Flush()
          @     0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
          @     0x7f7bf736fe7e  mesos::internal::master::Master::reconcileKnownSlave()
          @     0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
          @     0x7f7bf73a580e  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
      EEEERKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
      1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
          @     0x7f7bf7f5e69c  process::ProcessBase::visit()
          @     0x7f7bf7f71403  process::ProcessManager::resume()
          @     0x7f7bf7f7c127  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
          @     0x7f7bf60b5c80  (unknown)
          @     0x7f7bf58c86ba  start_thread
          @     0x7f7bf55fe3dd  (unknown)
      mesos-master.service: Main process exited, code=killed, status=6/ABRT
      mesos-master.service: Unit entered failed state.
      mesos-master.service: Failed with result 'signal'.
      

      The issue happened again on Mesos 1.5 (docker mesos master from the mesosphere docker repo):

      Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815433    13 http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
      Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815588    13 master.cpp:5467] Processing DECLINE call for offers: [ 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 5e57f633-a69c-4009-b7
      Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815693    13 master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
      Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820142    10 master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
      Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820367    10 registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the registry
      Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820572    10 registrar.cpp:552] Successfully updated the registry in 175872ns
      Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820642    11 master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
      Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957     9 hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
      Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.851961    11 master.cpp:10018] Check failed: 'framework' Must be non NULL
      Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d  google::LogMessage::Fail()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830  google::LogMessage::SendToLog()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663  google::LogMessage::Flush()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259  google::LogMessageFatal::~LogMessageFatal()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14  google::CheckNotNull<>()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8  mesos::internal::master::Master::__removeSlave()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2  mesos::internal::master::Master::_markUnreachable()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11  process::ProcessBase::consume()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a  process::ProcessManager::resume()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80  (unknown)
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba  start_thread
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d  (unknown)
      Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) try "date -d @1520762676" if you are using GNU date ***
      Mar 11 10:04:36 research docker[4503]: PC: @     0x7f96c2a4d196 (unknown)
      Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 (TID 0x7f96b986d700) from PID 0; stack trace: ***
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c2df1390 (unknown)
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c2a4d196 (unknown)
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c604ce2c google::DumpStackTraceAndExit()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d google::LogMessage::Fail()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830 google::LogMessage::SendToLog()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663 google::LogMessage::Flush()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259 google::LogMessageFatal::~LogMessageFatal()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14 google::CheckNotNull<>()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8 mesos::internal::master::Master::__removeSlave()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2 mesos::internal::master::Master::_markUnreachable()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11 process::ProcessBase::consume()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a process::ProcessManager::resume()
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80 (unknown)
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba start_thread
      Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d (unknown)
      Mar 11 10:04:38 research systemd[1]: mesos-master2.service: main process exited, code=exited, status=139/n/a
      Mar 11 10:04:38 research docker[18886]: mesos-master
      Mar 11 10:04:38 research systemd[1]: Unit mesos-master2.service entered failed state.
      

      The failure in this case seems to happen right after an agent drops out of the cluster - which is a similar failure condition to the first time.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              drribosome Jack Crawford
              Vinod Kone Vinod Kone
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: