Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9909

Mesos agent crashes after recovery when there is nested container joins a CNI network

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.8.1
    • 1.9.0
    • cni, containerization
    • Containerization: RI-16 51, Containerization: RI-17 52, Containerization: RI-17 53
    • 2

    Description

      Reproduce steps:

      1. Use `mesos-execute` to launch a task group with checkpoint enabled. The task in the task group joins a CNI network `net1` and has health check enabled, and the health check will succeed for the first time, fail for the second time, and succeed for the third time, ... The reason that we do health check in this way is that we want to keep generating status updates for this task after recovery.

      $ mesos-execute --master=<masterIP>:5050 --task_group=file:///tmp/task_group.json --checkpoint
      $ cat /tmp/task_group.json
      {
        "tasks":[
          {
            "name" : "test",
            "task_id" : {"value" : "test"},
            "agent_id": {"value" : ""},
            "resources": [
              {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
              {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
            ],
            "command": {
              "value": "ip a && sleep 55555"
            },
            "container": {
              "type": "MESOS",
              "network_infos": [
                {
                  "name": "net1"
                }
              ]
            },
            "health_check": {
              "type": "COMMAND",
              "command": {
                "value": "if test -f file; then rm -rf file && exit 1; else touch file && exit 0; fi"
              }
            }
          }
        ]
      }
      

       2. Restart Mesos agent, and then we will see Mesos agent crashes when it handles `TASK_RUNNING` status update triggered by the health check.

      I0728 16:44:34.485939 3513 slave.cpp:5702] Handling status update TASK_RUNNING (Status UUID: 81fa5c56-4d79-4da4-846a-05e94591728b) for task test in health state healthy of framework 990a6379-5727-4490-9abe-7869ff8a1cf2-0000
      F0728 16:44:34.528841 3510 cni.cpp:1462] CHECK_SOME(containerNetwork.networkInfo): is NONE
      *** Check failure stack trace: ***
      @ 0x7ffff5000e12 google::LogMessage::Fail()
      @ 0x7ffff5000d5b google::LogMessage::SendToLog()
      @ 0x7ffff50006e7 google::LogMessage::Flush()
      @ 0x7ffff5003dfe google::LogMessageFatal::~LogMessageFatal()
      @ 0x5555555f90b0 _CheckFatal::~_CheckFatal()
      @ 0x7ffff372f994 mesos::internal::slave::NetworkCniIsolatorProcess::status()
      @ 0x7ffff2e16a90 _ZZN7process8dispatchIN5mesos15ContainerStatusENS1_8internal5slave20MesosIsolatorProcessERKNS1_11ContainerIDES8_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSD_FSB_T1_EOT2_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteISO_EEOS6_PNS_11ProcessBaseEE_clESR_SS_SU_
      @ 0x7ffff2e20d57 _ZN5cpp176invokeIZN7process8dispatchIN5mesos15ContainerStatusENS3_8internal5slave20MesosIsolatorProcessERKNS3_11ContainerIDESA_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSF_FSD_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseIS4_EESt14default_deleteISQ_EEOS8_PNS1_11ProcessBaseEE_JST_S8_SW_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSY_
      @ 0x7ffff2e1ff2f _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEE13invoke_expandISY_St5tupleIJSU_S9_S10_EES13_IJOSX_EEJLm0ELm1ELm2EEEEDTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISG_Efp0_EEcl7forwardISK_Efp2_EEEEOSD_OSG_N5cpp1416integer_sequenceImJXspT2_EEEEOSK_
      @ 0x7ffff2e1f75e _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS4_8internal5slave20MesosIsolatorProcessERKNS4_11ContainerIDESB_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSG_FSE_T1_EOT2_EUlSt10unique_ptrINS2_7PromiseIS5_EESt14default_deleteISR_EEOS9_PNS2_11ProcessBaseEE_JSU_S9_St12_PlaceholderILi1EEEEclIJSX_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS16_
      @ 0x7ffff2e1f20e _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS6_8internal5slave20MesosIsolatorProcessERKNS6_11ContainerIDESD_EENS4_6FutureIT_EERKNS4_3PIDIT0_EEMSI_FSG_T1_EOT2_EUlSt10unique_ptrINS4_7PromiseIS7_EESt14default_deleteIST_EEOSB_PNS4_11ProcessBaseEE_JSW_SB_St12_PlaceholderILi1EEEEEJSZ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOS14_
      @ 0x7ffff2e1ef11 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos15ContainerStatusENS7_8internal5slave20MesosIsolatorProcessERKNS7_11ContainerIDESE_EENS5_6FutureIT_EERKNS5_3PIDIT0_EEMSJ_FSH_T1_EOT2_EUlSt10unique_ptrINS5_7PromiseIS8_EESt14default_deleteISU_EEOSC_PNS5_11ProcessBaseEE_JSX_SC_St12_PlaceholderILi1EEEEEJS10_EEEvOSG_DpOT0_
      @ 0x7ffff2e1ead6 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos15ContainerStatusENSA_8internal5slave20MesosIsolatorProcessERKNSA_11ContainerIDESH_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSM_FSK_T1_EOT2_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteISX_EEOSF_S3_E_JS10_SF_St12_PlaceholderILi1EEEEEEclEOS3_
      @ 0x7ffff4f0ad6b _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
      @ 0x7ffff4ecdb4a process::ProcessBase::consume()
      @ 0x7ffff4ef79d0 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
      @ 0x5555555f9c1e process::ProcessBase::serve()
      @ 0x7ffff4eca4e8 process::ProcessManager::resume()
      @ 0x7ffff4ec695e _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
      @ 0x7ffff4ed5c7f _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
      @ 0x7ffff4ed28ae _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
      @ 0x7ffff4ef0e2c _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEDTcl8__invokespcl10_S_declvalIXT_EEEEESt12_Index_tupleIJXspT_EEE
      @ 0x7ffff4eefea0 _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
      @ 0x7ffff4eeece0 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
      @ 0x7fffe6eb957f (unknown)
      @ 0x7fffe69cc6db start_thread
      @ 0x7fffe66f588f clone
      

       

      Attachments

        Activity

          People

            qianzhang Qian Zhang
            qianzhang Qian Zhang
            Gilbert Song Gilbert Song
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: