Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9025

The container which joins CNI network and has checkpoint enabled will be mistakenly destroyed by agent

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.6.0
    • 1.6.1, 1.7.0
    • containerization
    • Mesosphere Sprint 2018-23
    • 3

    Description

      Reproduce steps:

      1) Run mesos-execute to launch a task which joins a CNI network net1 and has checkpoint enabled:

      $ cat task_cni.json
      {
        "name": "test1",
        "task_id": {"value" : "test1"},
        "agent_id": {"value" : ""},
        "resources": [
          {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
          {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
        ],
        "command": {
          "value": "sleep 1000"
        },
        "container": {
          "type": "MESOS",
          "network_infos": [
            {
              "name": "net1"
            }
          ]
        }
      }
      
      $ mesos-execute --master=192.168.56.5:5050 --task=file:///home/stack/workspace/config/task_cni.json --checkpoint
      

      2) After task is in the TASK_RUNNING state, restart the agent process, and then in the agent log, we will see the container is destroyed.

      ...
      I0622 17:30:00.792310  7426 containerizer.cpp:1024] Recovering isolators
      I0622 17:30:00.798740  7430 cni.cpp:437] Removing unknown orphaned container faf69105-e76f-49c7-8e56-964c2f882cff
      ...
      I0622 17:30:01.025600  7433 cni.cpp:1546] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/faf69105-e76f-49c7-8e56-964c2f882cff/ns' for container faf69105-e76f-49c7-8e56-964c2f882cff
      I0622 17:30:01.026211  7433 cni.cpp:1557] Removed the container directory '/run/mesos/isolators/network/cni/faf69105-e76f-49c7-8e56-964c2f882cff'
      I0622 17:30:02.935093  7429 slave.cpp:5215] Cleaning up un-reregistered executors
      I0622 17:30:02.935221  7429 slave.cpp:5233] Killing un-reregistered executor 'test1' of framework dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-0002 at executor(1)@192.168.11.7:33719
      I0622 17:30:02.935900  7429 slave.cpp:7311] Finished recovery
      I0622 17:30:02.937409  7427 containerizer.cpp:2405] Destroying container faf69105-e76f-49c7-8e56-964c2f882cff in RUNNING state
      

      And mesos-execute will receive a TASK_GONE for the task:

      $ mesos-execute --master=192.168.56.5:5050 --task=file:///home/stack/workspace/config/task_cni.json --checkpoint
      I0622 17:29:50.538630  7246 scheduler.cpp:189] Version: 1.7.0
      I0622 17:29:50.548589  7261 scheduler.cpp:355] Using default 'basic' HTTP authenticatee
      I0622 17:29:50.550348  7263 scheduler.cpp:538] New master detected at master@192.168.56.5:5050
      Subscribed with ID dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-0002
      Submitted task 'test1' to agent 'dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-S0'
      Received status update TASK_STARTING for task 'test1'
        source: SOURCE_EXECUTOR
      Received status update TASK_RUNNING for task 'test1'
        source: SOURCE_EXECUTOR
      Received status update TASK_GONE for task 'test1'
        message: 'Executor did not reregister within 2secs'
        source: SOURCE_AGENT
        reason: REASON_EXECUTOR_REREGISTRATION_TIMEOUT
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jieyu Jie Yu
            qianzhang Qian Zhang
            Qian Zhang Qian Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprint:
                Mesosphere Sprint 2018-23 ended 05/Jul/18
                View on Board

                Slack

                  Issue deployment