Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9025

The container which joins CNI network and has checkpoint enabled will be mistakenly destroyed by agent

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.6.0
    • 1.6.1, 1.7.0
    • containerization
    • Mesosphere Sprint 2018-23
    • 3

    Description

      Reproduce steps:

      1) Run mesos-execute to launch a task which joins a CNI network net1 and has checkpoint enabled:

      $ cat task_cni.json
      {
        "name": "test1",
        "task_id": {"value" : "test1"},
        "agent_id": {"value" : ""},
        "resources": [
          {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
          {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
        ],
        "command": {
          "value": "sleep 1000"
        },
        "container": {
          "type": "MESOS",
          "network_infos": [
            {
              "name": "net1"
            }
          ]
        }
      }
      
      $ mesos-execute --master=192.168.56.5:5050 --task=file:///home/stack/workspace/config/task_cni.json --checkpoint
      

      2) After task is in the TASK_RUNNING state, restart the agent process, and then in the agent log, we will see the container is destroyed.

      ...
      I0622 17:30:00.792310  7426 containerizer.cpp:1024] Recovering isolators
      I0622 17:30:00.798740  7430 cni.cpp:437] Removing unknown orphaned container faf69105-e76f-49c7-8e56-964c2f882cff
      ...
      I0622 17:30:01.025600  7433 cni.cpp:1546] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/faf69105-e76f-49c7-8e56-964c2f882cff/ns' for container faf69105-e76f-49c7-8e56-964c2f882cff
      I0622 17:30:01.026211  7433 cni.cpp:1557] Removed the container directory '/run/mesos/isolators/network/cni/faf69105-e76f-49c7-8e56-964c2f882cff'
      I0622 17:30:02.935093  7429 slave.cpp:5215] Cleaning up un-reregistered executors
      I0622 17:30:02.935221  7429 slave.cpp:5233] Killing un-reregistered executor 'test1' of framework dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-0002 at executor(1)@192.168.11.7:33719
      I0622 17:30:02.935900  7429 slave.cpp:7311] Finished recovery
      I0622 17:30:02.937409  7427 containerizer.cpp:2405] Destroying container faf69105-e76f-49c7-8e56-964c2f882cff in RUNNING state
      

      And mesos-execute will receive a TASK_GONE for the task:

      $ mesos-execute --master=192.168.56.5:5050 --task=file:///home/stack/workspace/config/task_cni.json --checkpoint
      I0622 17:29:50.538630  7246 scheduler.cpp:189] Version: 1.7.0
      I0622 17:29:50.548589  7261 scheduler.cpp:355] Using default 'basic' HTTP authenticatee
      I0622 17:29:50.550348  7263 scheduler.cpp:538] New master detected at master@192.168.56.5:5050
      Subscribed with ID dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-0002
      Submitted task 'test1' to agent 'dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-S0'
      Received status update TASK_STARTING for task 'test1'
        source: SOURCE_EXECUTOR
      Received status update TASK_RUNNING for task 'test1'
        source: SOURCE_EXECUTOR
      Received status update TASK_GONE for task 'test1'
        message: 'Executor did not reregister within 2secs'
        source: SOURCE_AGENT
        reason: REASON_EXECUTOR_REREGISTRATION_TIMEOUT
      

      Attachments

        Issue Links

          Activity

            People

              jieyu Jie Yu
              qianzhang Qian Zhang
              Qian Zhang Qian Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: