Description
It seems like that a race condition accessing the Mesos filesystem layout can bubble up and terminate the TaskObserver thread responsible for refreshing the internal data structure of available tasks. Restarting the observer fixes the problem.
Exception triggering the issue:
Traceback (most recent call last): File "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.bce9e54ac7cded79a75603fb4e6bcef2c7d1e6bc/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run self.__real_run(*args, **kw) File "apache/thermos/observer/task_observer.py", line 135, in run File "apache/thermos/observer/detector.py", line 74, in refresh File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors File "apache/aurora/executor/common/path_detector.py", line 34, in get_paths File "apache/aurora/executor/common/path_detector.py", line 34, in <genexpr> File "apache/aurora/executor/common/path_detector.py", line 33, in iterate File "/usr/lib/python2.7/posixpath.py", line 376, in realpath resolved = _resolve_link(component) File "/usr/lib/python2.7/posixpath.py", line 399, in _resolve_link resolved = os.readlink(path) OSError: [Errno 2] No such file or directory: '/var/lib/mesos/slaves/0768bcb3-205d-4409-a726-3001ad3ef902-S10/frameworks/20151001-085346-58917130-5050-37976-0000/executors/thermos-role-env-myname-0-f9fe0318-d39f-49d3-bdf8-e954d5879b33/runs/latest'
Solution space:
- terminate the observer process if the TaskOberver thread fails
- prevent unknown exceptions from aborting the TaskOberver run loop
- prevent the observed race condition in detector.py or path_detector.py