Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9151

Container stuck at ISOLATING due to FD leak

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      When containers are launching on a single agent at scale, one container stuck at ISOLATING could occasionally happen. And this container becomes un-destroyable due to containerizer destroy always wait for isolate() finish to continue.

      We add more logging to debug this issue:

      Aug 10 17:23:28 ip-10-0-1-129.us-west-2.compute.internal mesos-agent[2974]: I0810 17:23:28.050068  2995 collect.hpp:271] $$$$: AwaitProcess waited invoked for ProcessBase ID: __await__(26651); futures size: 3; future: Ready; future index: 2; ready count: 1
      Aug 10 17:23:28 ip-10-0-1-129.us-west-2.compute.internal mesos-agent[2974]: I0810 17:23:28.414436  2998 collect.hpp:271] $$$$: AwaitProcess waited invoked for ProcessBase ID: __await__(26651); futures size: 3; future: Ready; future index: 0; ready count: 2
      

      which shows that the await() in CNI::attach() stuck at the second future (io::read() for stdout).

      By looking at the df of this stdout:

      Aug 10 17:23:27 ip-10-0-1-129.us-west-2.compute.internal mesos-agent[2974]: I0810 17:23:27.657501  2995 cni.cpp:1287] !!!!: Start to await for plugin '/opt/mesosphere/active/mesos/libexec/mesos/mesos-cni-port-mapper' to finish for container 1c8abf4c-f71a-4704-9a73-1ab0dd709c62 with pid '16644'; stdout fd: 1781; stderr fd: 1800
      

      We found

      core@ip-10-0-1-129 ~ $ ps aux | grep mesos-agent
      core      1674  0.0  0.0   6704   864 pts/0    S+   20:00   0:00 grep --colour=auto mesos-agent
      root      2974 16.4  2.5 1211096 414048 ?      Ssl  17:02  29:11 /opt/mesosphere/packages/mesos--61265af3be37861f26b657c1f9800293b86a0374/bin/mesos-agent
      core@ip-10-0-1-129 ~ $ sudo ls -al /proc/2974/fd/ | grep 1781
      l-wx------. 1 root root 64 Aug 10 19:38 1781 -> /var/lib/mesos/slave/meta/slaves/d3089315-8e34-40b4-b1f7-0ac6a624d7db-S0/frameworks/d3089315-8e34-40b4-b1f7-0ac6a624d7db-0000/executors/test2.d820326d-9cc1-11e8-9809-ee15da5c8980/runs/38e9270d-ebda-4758-ad96-40c5b84bffdc/tasks/test2.d820326d-9cc1-11e8-9809-ee15da5c8980/task.updates
      
      core@ip-10-0-1-129 ~ $ ps aux | grep 27981
      core      2201  0.0  0.0   6704   884 pts/0    S+   20:06   0:00 grep --colour=auto 27981
      root     27981  0.0  0.0   1516     4 ?        Ss   17:25   0:00 sleep 10000
      core@ip-10-0-1-129 ~ $ cat /proc/s^C       
      core@ip-10-0-1-129 ~ $ sudo -s
      ip-10-0-1-129 core # ls -al /proc/27981/fd | grep 275230
      lr-x------. 1 root root 64 Aug 10 20:05 1781 -> pipe:[275230]
      l-wx------. 1 root root 64 Aug 10 20:05 1787 -> pipe:[275230]
      
      core@ip-10-0-1-129 ~ $ sudo ls -al /proc/2974/fd/ | grep pipe
      lr-x------. 1 root root 64 Aug 10 17:02 11 -> pipe:[49380]
      l-wx------. 1 root root 64 Aug 10 17:02 14 -> pipe:[49380]
      lr-x------. 1 root root 64 Aug 10 17:02 17 -> pipe:[48909]
      lr-x------. 1 root root 64 Aug 10 19:38 1708 -> pipe:[275089]
      l-wx------. 1 root root 64 Aug 10 19:38 1755 -> pipe:[275089]
      lr-x------. 1 root root 64 Aug 10 19:38 1787 -> pipe:[275230]
      l-wx------. 1 root root 64 Aug 10 17:02 19 -> pipe:[48909]
      

      pipe 275230 is held by the agent process and the sleep process at the same time!

      The reason why the leak is possible is because we don't use `pipe2` to create a pipe with `O_CLOEXEC` in subprocess:
      https://github.com/apache/mesos/blob/1.5.x/3rdparty/libprocess/src/subprocess_posix.cpp#L61

      Although we do set cloexec on those fds later:
      https://github.com/apache/mesos/blob/1.5.x/3rdparty/libprocess/src/subprocess.cpp#L366-L373

      There is a race where a fork happens after `pipe()` call, but before cloexec is called later. This is more likely on a busy system (this explains why it's not hard to repo the issue when launching a lot of containers on a single box).

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            gilbert Gilbert Song
            gilbert Gilbert Song
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Agile

                Completed Sprints:
                Mesosphere Sprint 2018-26 ended 16/Aug/18
                Mesosphere Sprint 2018-27 ended 30/Aug/18
                Mesosphere Sprint 2018-28 ended 15/Sep/18
                View on Board

                Slack

                  Issue deployment