Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Abandoned
-
1.11.0
-
None
-
None
-
agent:ubuntu18.04
Description
When launch a task using image "horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1",
if tensorboard in this image is started,
the agent node will immediately crash every time.
if tensorboard is not started by command, mesos will just work as expected.
agent log looks like below:
//agent crash I0127 16:07:21.860065 30960 slave.cpp:3181] Launching task 'baseEnvSingle_gpunode1' for framework baseDevEnv_root_1611734806 F0127 16:07:21.860143 30960 slave.cpp:3194] Check failed: executor == nullptr *** Check failure stack trace: *** @ 0x7f2bcc4221fc google::LogMessage::Fail() @ 0x7f2bcc422145 google::LogMessage::SendToLog() @ 0x7f2bcc421ad1 google::LogMessage::Flush() @ 0x7f2bcc4251e8 google::LogMessageFatal::~LogMessageFatal() @ 0x7f2bca4cb10b mesos::internal::slave::Slave::__run() @ 0x7f2bca570ac6 _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSB_INS1_13TaskGroupInfoEERKSt6vectorINS2_19ResourceVersionUUIDESaISL_EERKSB_IbEbS7_SA_SF_SJ_SP_SS_bEEvRKNS_3PIDIT_EEMSU_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_ENKUlOS5_OS8_OSD_OSH_OSN_OSQ_ObPNS_11ProcessBaseEE_clES1L_S1M_S1N_S1O_S1P_S1Q_S1R_S1T_ @ 0x7f2bca663b01 _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS3_13FrameworkInfoERKNS3_12ExecutorInfoERK6OptionINS3_8TaskInfoEERKSD_INS3_13TaskGroupInfoEERKSt6vectorINS4_19ResourceVersionUUIDESaISN_EERKSD_IbEbS9_SC_SH_SL_SR_SU_bEEvRKNS1_3PIDIT_EEMSW_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS7_OSA_OSF_OSJ_OSP_OSS_ObPNS1_11ProcessBaseEE_JS7_SA_SF_SJ_SP_SS_bS1V_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1X_ @ 0x7f2bca6555dc _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEE13invoke_expandIS1X_St5tupleIJS8_SB_SG_SK_SQ_ST_bS1Z_EES22_IJOS1W_EEJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEEDTcl6invokecl7forwardISX_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS12_Efp2_EEEEOSX_OS11_N5cpp1416integer_sequenceImJXspT2_EEEEOS12_ @ 0x7f2bca64da94 _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1EEEEclIJS1W_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS25_ @ 0x7f2bca647e56 _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS6_13FrameworkInfoERKNS6_12ExecutorInfoERK6OptionINS6_8TaskInfoEERKSG_INS6_13TaskGroupInfoEERKSt6vectorINS7_19ResourceVersionUUIDESaISQ_EERKSG_IbEbSC_SF_SK_SO_SU_SX_bEEvRKNS4_3PIDIT_EEMSZ_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSA_OSD_OSI_OSM_OSS_OSV_ObPNS4_11ProcessBaseEE_JSA_SD_SI_SM_SS_SV_bSt12_PlaceholderILi1EEEEEJS1Y_EEEDTclcl7forwardISZ_Efp_Espcl7forwardIT0_Efp0_EEEOSZ_DpOS23_ @ 0x7f2bca645145 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS7_13FrameworkInfoERKNS7_12ExecutorInfoERK6OptionINS7_8TaskInfoEERKSH_INS7_13TaskGroupInfoEERKSt6vectorINS8_19ResourceVersionUUIDESaISR_EERKSH_IbEbSD_SG_SL_SP_SV_SY_bEEvRKNS5_3PIDIT_EEMS10_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSB_OSE_OSJ_OSN_OST_OSW_ObPNS5_11ProcessBaseEE_JSB_SE_SJ_SN_ST_SW_bSt12_PlaceholderILi1EEEEEJS1Z_EEEvOS10_DpOT0_ @ 0x7f2bca641d60 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbEbSG_SJ_SO_SS_SY_S11_bEEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_ObS3_E_JSE_SH_SM_SQ_SW_SZ_bSt12_PlaceholderILi1EEEEEEclEOS3_ @ 0x7f2bcc2f7a59 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ @ 0x7f2bcc2baae8 process::ProcessBase::consume() @ 0x7f2bcc2e475c _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE @ 0x55bb81f997ae process::ProcessBase::serve() @ 0x7f2bcc2b7486 process::ProcessManager::resume() @ 0x7f2bcc2b3878 _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv @ 0x7f2bcc2c2c1d _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_ @ 0x7f2bcc2bf84c _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_ @ 0x7f2bcc2dddca _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE @ 0x7f2bcc2dce3e _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv @ 0x7f2bcc2dbc7e _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv @ 0x7f2bbda376df (unknown) @ 0x7f2bbd54a6db start_thread @ 0x7f2bbd27371f clone Aborted (core dumped)