Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.22.0
-
None
-
None
-
red hat linux 6.5
Description
We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to hang (two threads stuck in indefinite pthread condition waits) on a test case that as best we can tell is mesos issue and not issue with underlying apache zookeeper C binding.
(that is we tried same type case using apache zookeeper C binding directly and saw no issues.)
This happens with a properly running zookeeper (standalone is sufficient).
Heres how we hung it:
We issue a mesos zk set via
int ZooKeeper::set ( const std::string & path,
const std::string & data,
int version
)
then inside a Watcher we process on CHANGED event to issue a mesos zk get on
the same path via
int ZooKeeper::get ( const std::string & path,
bool watch,
std::string * result,
Stat * stat
)
we end up with two threads in the process both in pthread_cond_waits
#0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0)
at ../../../3rdparty/libprocess/src/gate.hpp:82
#2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...)
at ../../../3rdparty/libprocess/src/process.cpp:2476
#3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
at ../../../3rdparty/libprocess/src/process.cpp:2958
#4 0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...)
at ../../../3rdparty/libprocess/src/latch.cpp:49
#5 0x00007f66649452cc in process::Future<int>::await (this=0x7fffa0fd9040,
duration=...)
at ../../3rdparty/libprocess/include/process/future.hpp:1156
#6 0x00007f666493a04d in process::Future<int>::get (this=0x7fffa0fd9040)
at ../../3rdparty/libprocess/include/process/future.hpp:1167
#7 0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, path="/craig/mo", data=
...
and
#0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0)
at ../../../3rdparty/libprocess/src/gate.hpp:82
#2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...)
at ../../../3rdparty/libprocess/src/process.cpp:2476
#3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
at ../../../3rdparty/libprocess/src/process.cpp:2958
#4 0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00,
duration=...)
at ../../../3rdparty/libprocess/src/latch.cpp:49
#5 0x00007f66649452cc in process::Future<int>::await (this=0x7f66595fb6f0,
duration=...)
at ../../3rdparty/libprocess/include/process/future.hpp:1156
#6 0x00007f666493a04d in process::Future<int>::get (this=0x7f66595fb6f0)
at ../../3rdparty/libprocess/include/process/future.hpp:1167
#7 0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path="/craig/mo",
watch=false,
....
We of course have a separate "enhancement" suggestion that the mesos C++ zookeeper api use timed waits and not block indefinitely for responses.
But this case we think the mesos code itself is blocking on itself and not handling the responses.
craig
Attachments
Attachments
Issue Links
- is related to
-
MESOS-8255 ZooKeeper API is blocking, can lead to deadlock of libprocess worker threads.
- Open
-
MESOS-8256 Libprocess can silently deadlock due to worker thread exhaustion.
- Accepted