Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8051

Killing TASK_GROUP fail to kill some tasks

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.0
    • 1.2.3, 1.3.2, 1.4.1, 1.5.0
    • agent, executor
    • None
    • Mesosphere Sprint 66, Mesosphere Sprint 67
    • 2

    Description

      When starting following pod definition via marathon:

      {
        "id": "/simple-pod",
        "scaling": {
          "kind": "fixed",
          "instances": 3
        },
        "environment": {
          "PING": "PONG"
        },
        "containers": [
          {
            "name": "ct1",
            "resources": {
              "cpus": 0.1,
              "mem": 32
            },
            "image": {
              "kind": "MESOS",
              "id": "busybox"
            },
            "exec": {
              "command": {
                "shell": "while true; do echo the current time is $(date) > ./test-v1/clock; sleep 1; done"
              }
            },
            "volumeMounts": [
              {
                "name": "v1",
                "mountPath": "test-v1"
              }
            ]
          },
          {
            "name": "ct2",
            "resources": {
              "cpus": 0.1,
              "mem": 32
            },
            "exec": {
              "command": {
                "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 1; done"
              }
            },
            "volumeMounts": [
              {
                "name": "v1",
                "mountPath": "etc"
              },
              {
                "name": "v2",
                "mountPath": "docker"
              }
            ]
          }
        ],
        "networks": [
          {
            "mode": "host"
          }
        ],
        "volumes": [
          {
            "name": "v1"
          },
          {
            "name": "v2",
            "host": "/var/lib/docker"
          }
        ]
      }
      

      mesos will successfully kill all ct2 containers but fail to kill all/some of the ct1 containers. I've attached both master and agent logs. The interesting part starts after marathon issues 6 kills:

      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.209966  4746 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
      bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210033  4746 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
      10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
      .229:15101
      
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210471  4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d
      bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210518  4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
      10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
      .229:15101
      
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210602  4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
      bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210639  4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
      10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
      .229:15101
      
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210932  4753 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d
      bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210968  4753 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
      10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
      .229:15101
      
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.211210  4747 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
      bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.211251  4747 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
      10.0.1.207) to kill task simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
      .229:15101
      
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.211474  4746 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853d
      bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101
      Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.211514  4746 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (
      10.0.1.207) to kill task simple-pod.instance-328cd633-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5
      .229:15101
      

      All .ct1 tasks fail eventually (~30s) where .ct2 are successfully killed.

      Attachments

        1. dcos-mesos-master.log.gz
          779 kB
          A. Dukhovniy
        2. dcos-mesos-slave.log.gz
          399 kB
          A. Dukhovniy
        3. screenshot-1.png
          337 kB
          A. Dukhovniy

        Activity

          People

            qianzhang Qian Zhang
            zen-dog A. Dukhovniy
            Vinod Kone Vinod Kone
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: