Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1801

TaskObserver thread stops refreshing after filesystem race condition

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.17.0
    • Observer
    • None

    Description

      It seems like that a race condition accessing the Mesos filesystem layout can bubble up and terminate the TaskObserver thread responsible for refreshing the internal data structure of available tasks. Restarting the observer fixes the problem.

      Exception triggering the issue:

      Traceback (most recent call last):
        File "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.bce9e54ac7cded79a75603fb4e6bcef2c7d1e6bc/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run
          self.__real_run(*args, **kw)
        File "apache/thermos/observer/task_observer.py", line 135, in run
        File "apache/thermos/observer/detector.py", line 74, in refresh
        File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors
        File "apache/aurora/executor/common/path_detector.py", line 34, in get_paths
        File "apache/aurora/executor/common/path_detector.py", line 34, in <genexpr>
        File "apache/aurora/executor/common/path_detector.py", line 33, in iterate
        File "/usr/lib/python2.7/posixpath.py", line 376, in realpath
          resolved = _resolve_link(component)
        File "/usr/lib/python2.7/posixpath.py", line 399, in _resolve_link
          resolved = os.readlink(path)
      OSError: [Errno 2] No such file or directory: '/var/lib/mesos/slaves/0768bcb3-205d-4409-a726-3001ad3ef902-S10/frameworks/20151001-085346-58917130-5050-37976-0000/executors/thermos-role-env-myname-0-f9fe0318-d39f-49d3-bdf8-e954d5879b33/runs/latest'
      

      Solution space:

      • terminate the observer process if the TaskOberver thread fails
      • prevent unknown exceptions from aborting the TaskOberver run loop
      • prevent the observed race condition in detector.py or path_detector.py

      Attachments

        Activity

          People

            StephanErb Stephan Erb
            StephanErb Stephan Erb
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: