Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.7.0
-
None
-
None
Description
This happens with the latest code from github.
I'm trying to schedule the hello_world example using a non-root role. The thermos_runner crashes when it tries to write the checkpoint in the fetch_package process.
It looks like what is happening is the runner is executing as the non-root user, but the checkpoint is owned by root.
Unfortunately the error handling in Aurora is not very good. The exception thrown by the runner is silently swallowed, and the fetch_package process is running without showing any failures in the log files. I was able to figure out what's going on by manually running the command.
As a workaround I added user 'ovidiu' to group 'root', since the directory containing the checkpoint has 'rwx' permissions for the group.
This is the command:
/usr/bin/python2.7 /var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex --setuid=ovidiu --thermos_json=/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/task.json --sandbox=/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/sandbox --log_dir=. --task_id=1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1 --log_to_disk=DEBUG --checkpoint_root=/var/run/thermos --hostname=m1a.dc
And here is the output:
Writing log files to disk in .
ERROR] Found existing runner, cannot take control.
ERROR] Unknown exception: Unable to open checkpoint /var/run/thermos/checkpoints/1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runner
ERROR] Traceback (most recent call last):
ERROR] File "/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex/apache/thermos/bin/thermos_runner.py", line 176, in proxy_main
ERROR] File "/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex/apache/thermos/core/runner.py", line 859, in run
ERROR] with self.control(force):
ERROR] File "/usr/lib/python2.7/contextlib.py", line 17, in _enter_
ERROR] return self.gen.next()
ERROR] File "/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex/apache/thermos/core/runner.py", line 552, in control
ERROR] raise self.PermissionError('Unable to open checkpoint %s' % ckpt_file)
ERROR] PermissionError: Unable to open checkpoint /var/run/thermos/checkpoints/1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runner