Details
Description
This is due to 'fork()' is not implemented async signal safe in glibc, although according to Posix, it should be. When the child tries to execute commands returned from isolator prepare(), it will use os::system which uses 'fork'.
I observed this stack trace when I debug a deadlock:
(gdb) bt #0 0x00007f8fb2d5d2ce in __lll_lock_wait_private () from /lib64/libc.so.6 #1 0x00007f8fb2ce1d8e in _L_lock_44 () from /lib64/libc.so.6 #2 0x00007f8fb2cdab4c in ptmalloc_lock_all () from /lib64/libc.so.6 #3 0x00007f8fb2d11d65 in fork () from /lib64/libc.so.6 #4 0x00007f8fb4e898de in system (command=..., directory=<value optimized out>, envp=..., uid=0, gid=0, redirectIO=<value optimized out>, pipeRead=29, pipeWrite=30, commands=std::list = {...}) at ../../../mesos/3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:558 #5 mesos::internal::slave::execute (command=..., directory=<value optimized out>, envp=..., uid=0, gid=0, redirectIO=<value optimized out>, pipeRead=29, pipeWrite=30, commands=std::list = {...}) at ../../../mesos/src/slave/containerizer/mesos_containerizer.cpp:483 #6 0x00007f8fb4e97bab in __call<, 0, 1, 2, 3, 4, 5, 6, 7, 8> (__functor=<value optimized out>) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/tr1_impl/functional:1137 #7 operator()<> (__functor=<value optimized out>) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/tr1_impl/functional:1191 #8 std::tr1::_Function_handler<int(), std::tr1::_Bind<int (*(mesos::CommandInfo, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, os::ExecEnv, unsigned int, unsigned int, bool, int, int, std::list<Option<mesos::CommandInfo>, std::allocator<Option<mesos::CommandInfo> > >))(const mesos::CommandInfo&, const std::string&, const os::ExecEnv&, uid_t, gid_t, bool, int, int, const std::list<Option<mesos::CommandInfo>, std::allocator<Option<mesos::CommandInfo> > >&)> >::_M_invoke(const std::tr1::_Any_data &) (__functor=<value optimized out>) at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/tr1_impl/functional:1654 #9 0x00007f8fb4fcaebe in mesos::internal::slave::_childMain(const std::tr1::function<int()> &, int *) (childFunction=..., pipes=0x7f8fad4f0040) at ../../../mesos/src/slave/containerizer/linux_launcher.cpp:193 #10 0x00007f8fb2d4db6d in clone () from /lib64/libc.so.6 (gdb) info thread * 1 Thread 0x7f8fad4f1700 (LWP 62980) 0x00007f8fb2d5d2ce in __lll_lock_wait_private () from /lib64/libc.so.6
This stack trace matches the stack trace that has been discussed in glibc issue tracker:
https://sourceware.org/bugzilla/show_bug.cgi?id=4737
And they marked this issue as "WON'T FIX". Here is some discussion:
The Austin group met yesterday and retained the decision to interpret fork as async-signal-unsafe with future specifications mandating that posix_spawn be made async-signal-safe to fill the functionality gap. Minutes of the meeting are available at https://www.opengroup.org/austin/docs/austin_446.txt. I think this bug can now be closed as "WONTFIX"
Attachments
Issue Links
- relates to
-
MESOS-1434 MesosContainerizerIsolatorPreparationTest.ScriptFails seems flaky
- Resolved