-
Type:
Bug
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 0.22.0
-
Fix Version/s: None
-
Component/s: c++ api
-
Labels:None
-
Environment:
red hat linux 6.5
We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to hang (two threads stuck in indefinite pthread condition waits) on a test case that as best we can tell is mesos issue and not issue with underlying apache zookeeper C binding.
(that is we tried same type case using apache zookeeper C binding directly and saw no issues.)
This happens with a properly running zookeeper (standalone is sufficient).
Heres how we hung it:
We issue a mesos zk set via
int ZooKeeper::set ( const std::string & path,
const std::string & data,
int version
)
then inside a Watcher we process on CHANGED event to issue a mesos zk get on
the same path via
int ZooKeeper::get ( const std::string & path,
bool watch,
std::string * result,
Stat * stat
)
we end up with two threads in the process both in pthread_cond_waits
#0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0)
at ../../../3rdparty/libprocess/src/gate.hpp:82
#2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...)
at ../../../3rdparty/libprocess/src/process.cpp:2476
#3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
at ../../../3rdparty/libprocess/src/process.cpp:2958
#4 0x00007f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...)
at ../../../3rdparty/libprocess/src/latch.cpp:49
#5 0x00007f66649452cc in process::Future<int>::await (this=0x7fffa0fd9040,
duration=...)
at ../../3rdparty/libprocess/include/process/future.hpp:1156
#6 0x00007f666493a04d in process::Future<int>::get (this=0x7fffa0fd9040)
at ../../3rdparty/libprocess/include/process/future.hpp:1167
#7 0x00007f6664ab1aac in ZooKeeper::set (this=0x803ce0, path="/craig/mo", data=
...
and
#0 0x000000334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x00007f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0)
at ../../../3rdparty/libprocess/src/gate.hpp:82
#2 0x00007f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...)
at ../../../3rdparty/libprocess/src/process.cpp:2476
#3 0x00007f6664ed2ce9 in process::wait (pid=..., duration=...)
at ../../../3rdparty/libprocess/src/process.cpp:2958
#4 0x00007f6664e90558 in process::Latch::await (this=0x7f6638000d00,
duration=...)
at ../../../3rdparty/libprocess/src/latch.cpp:49
#5 0x00007f66649452cc in process::Future<int>::await (this=0x7f66595fb6f0,
duration=...)
at ../../3rdparty/libprocess/include/process/future.hpp:1156
#6 0x00007f666493a04d in process::Future<int>::get (this=0x7f66595fb6f0)
at ../../3rdparty/libprocess/include/process/future.hpp:1167
#7 0x00007f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path="/craig/mo",
watch=false,
....
We of course have a separate "enhancement" suggestion that the mesos C++ zookeeper api use timed waits and not block indefinitely for responses.
But this case we think the mesos code itself is blocking on itself and not handling the responses.
craig
- is related to
-
MESOS-8255 ZooKeeper API is blocking, can lead to deadlock of libprocess worker threads.
-
- Open
-
-
MESOS-8256 Libprocess can silently deadlock due to worker thread exhaustion.
-
- Accepted
-