Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0
-
Mesosphere Sprint 38
-
2
Description
There is a race in the SocketManager, between a remote link and disconnection of the underlying socket.
We potentially segfault here: https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512
*socket dereferences the shared pointer underpinning the Socket* object. However, the code above this line actually has ownership of the pointer:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499
If the socket dies during the link, the ignore_recv_data may delete the Socket underneath link:
https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411
The same race exists for send.
This race was discovered while running a new test in repetition:
https://reviews.apache.org/r/49175/
On OSX, I hit the race consistently every 500-800 repetitions:
3rdparty/libprocess/libprocess-tests --gtest_filter="ProcessRemoteLinkTest.RemoteLink" --gtest_break_on_failure --gtest_repeat=1000