Details
-
Bug
-
Status: Accepted
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
The Marathon has a testsuite that starts a Python simple HTTP server in a task group aka pod in Marathon wit ha persistent volume. After the test run we call /teardown and wait for the Marathon framework to be completed (see MesosTest).
Our CI checks whether we leak any tasks after all test runs. It turns out we do:
Will kill: root 18084 0.0 0.0 45380 13612 ? D 07:52 0:00 python src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z http://www.example.com Running 'sudo kill -9 18084 Wait for processes being killed... ... Couldn't kill some leaked processes: root 18084 0.0 0.0 45380 13612 ? D 07:52 0:00 python src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z http://www.example.com ammonite.$file.ci.utils$StageException: Stage Compile and Test failed.
The attached Mesos master and agents logs (see attachment) show
Updating the state of task resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a-0000 (latest state: TASK_KILLED, status update state: TASK_KILLED)
The executor logs (see zipped sandbox attached) shows
I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 running in child container cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae with SIGTERM signal I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation to SIGKILL in 3secs from now
The task logs
2018-04-06 07:52:36,620 INFO : 2.7.13 2018-04-06 07:52:36,621 DEBUG : ['src/app_mock.py', '35477', 'resident-pod-16322-fail', '2018-04-06T07:52:16.924Z', 'http://www.example.com'] 2018-04-06 07:52:36,621 INFO : AppMock[resident-pod-16322-fail 2018-04-06T07:52:16.924Z]: resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 has taken the stage at port 35477. Will query http://www.example.com for health and readiness status. 2018-04-06 07:52:38,895 DEBUG : Got GET request 172.16.10.198 - - [06/Apr/2018 07:52:38] "GET /pst1/foo HTTP/1.1" 200 -
Compare these to a single task without a persistent volume of the same run
I0406 07:52:39.925590 15585 exec.cpp:445] Executor asked to shutdown I0406 07:52:39.925698 15585 executor.cpp:171] Received SHUTDOWN event I0406 07:52:39.925717 15585 executor.cpp:748] Shutting down I0406 07:52:39.925729 15585 executor.cpp:863] Sending SIGTERM to process tree at pid 15591 I0406 07:52:40.029878 15585 executor.cpp:876] Sent SIGTERM to the following process trees: [ -+- 15591 sh -c echo APP PROXY $MESOS_TASK_ID RUNNING; /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py $PORT0 /app-156 v1 http://$HOST:0/%2Fapp-156/v1 \--- 15612 python /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py 35417 /app-156 v1 http://172.16.10.198:0/%2Fapp-156/v1 ] I0406 07:52:40.029912 15585 executor.cpp:880] Scheduling escalation to SIGKILL in 3secs from now 2018-04-06 07:52:40,029 WARNING : Received 15 signal. Closing the server... 2018-04-06 07:52:40,030 ERROR : Socket.error in the main thread: Traceback (most recent call last): File "/home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py", line 151, in <module> httpd.serve_forever() File "/usr/lib/python2.7/SocketServer.py", line 231, in serve_forever poll_interval) File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry return func(*args) File "/usr/lib/python2.7/SocketServer.py", line 456, in fileno return self.socket.fileno() File "/usr/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) File "/usr/lib/python2.7/socket.py", line 174, in _dummy raise error(EBADF, 'Bad file descriptor') error: [Errno 9] Bad file descriptor 2018-04-06 07:52:40,030 INFO : Closing the server...
The server receives the SIGTERM and shuts down.