Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8762

Farmework Teardown Leaves Task in Uninterruptible Sleep State D

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Accepted
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The Marathon has a testsuite that starts a Python simple HTTP server in a task group aka pod in Marathon wit ha persistent volume. After the test run we call /teardown and wait for the Marathon framework to be completed (see MesosTest).

       

      Our CI checks whether we leak any tasks after all test runs. It turns out we do:

      Will kill:
        root     18084  0.0  0.0  45380 13612 ?        D    07:52   0:00 python src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z http://www.example.com
      Running 'sudo kill -9 18084
      Wait for processes being killed...
      ...
      Couldn't kill some leaked processes:
        root     18084  0.0  0.0  45380 13612 ?        D    07:52   0:00 python src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z http://www.example.com
      ammonite.$file.ci.utils$StageException: Stage Compile and Test failed.
      

      The attached Mesos master and agents logs (see attachment) show

      Updating the state of task resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a-0000 (latest state: TASK_KILLED, status update state: TASK_KILLED)
      

      The executor logs (see zipped sandbox attached) shows

      I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event
      I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down
      I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 running in child container cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae with SIGTERM signal
      I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation to SIGKILL in 3secs from now
      

      The task logs

      2018-04-06 07:52:36,620 INFO    : 2.7.13
      2018-04-06 07:52:36,621 DEBUG   : ['src/app_mock.py', '35477', 'resident-pod-16322-fail', '2018-04-06T07:52:16.924Z', 'http://www.example.com']
      2018-04-06 07:52:36,621 INFO    : AppMock[resident-pod-16322-fail 2018-04-06T07:52:16.924Z]: resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 has taken the stage at port 35477. Will query http://www.example.com for health and readiness status.
      2018-04-06 07:52:38,895 DEBUG   : Got GET request
      172.16.10.198 - - [06/Apr/2018 07:52:38] "GET /pst1/foo HTTP/1.1" 200 -
      

      Compare these to a single task without a persistent volume of the same run

      I0406 07:52:39.925590 15585 exec.cpp:445] Executor asked to shutdown
      I0406 07:52:39.925698 15585 executor.cpp:171] Received SHUTDOWN event
      I0406 07:52:39.925717 15585 executor.cpp:748] Shutting down
      I0406 07:52:39.925729 15585 executor.cpp:863] Sending SIGTERM to process tree at pid 15591
      I0406 07:52:40.029878 15585 executor.cpp:876] Sent SIGTERM to the following process trees:
      [ 
      -+- 15591 sh -c echo APP PROXY $MESOS_TASK_ID RUNNING; /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py $PORT0 /app-156 v1 http://$HOST:0/%2Fapp-156/v1 
       \--- 15612 python /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py 35417 /app-156 v1 http://172.16.10.198:0/%2Fapp-156/v1 
      ]
      I0406 07:52:40.029912 15585 executor.cpp:880] Scheduling escalation to SIGKILL in 3secs from now
      2018-04-06 07:52:40,029 WARNING : Received 15 signal. Closing the server...
      2018-04-06 07:52:40,030 ERROR   : Socket.error in the main thread: 
      Traceback (most recent call last):
        File "/home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py", line 151, in <module>
          httpd.serve_forever()
        File "/usr/lib/python2.7/SocketServer.py", line 231, in serve_forever
          poll_interval)
        File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry
          return func(*args)
        File "/usr/lib/python2.7/SocketServer.py", line 456, in fileno
          return self.socket.fileno()
        File "/usr/lib/python2.7/socket.py", line 228, in meth
          return getattr(self._sock,name)(*args)
        File "/usr/lib/python2.7/socket.py", line 174, in _dummy
          raise error(EBADF, 'Bad file descriptor')
      error: [Errno 9] Bad file descriptor
      2018-04-06 07:52:40,030 INFO    : Closing the server...
      

      The server receives the SIGTERM and shuts down.

      Attachments

        1. master_agent.log
          18 kB
          Karsten
        2. UpgradeIntegrationTest.zip
          837 kB
          Karsten
        3. zombi-process.trace
          82 kB
          Karsten
        4. happy-process.trace
          14 kB
          Karsten
        5. zombie-process-sandbox.zip
          838 kB
          Karsten
        6. happy-process-sandbox.zip
          839 kB
          Karsten

        Activity

          People

            Unassigned Unassigned
            jeschkies Karsten
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: