|
I think
Before beginning discussions on approach, I wanted to summarize my understanding of this task, and also start discussion on a few points that I have some questions on.
The following are some salient points:
Is there anything that I am missing ? Comments on the questions of shared directories / files - distributed cache, intermediate outputs, log files ? I think that (2) depends on how (1) is proposed to be addressed. If you assume that (1) is addressed by using seteuid() or the su command such that processes actually run on the system as the appropriate user, then (2) is extremely difficult without being ruin as root.
If (1) is addressed just by setting the UGI in some way, then this had disadvantages compared to the seteuid/su - which facilitates secured access to non-HDFS resources (e.g. NFS in smaller environments). I had some offline discussions with Arun and Sameer and here are some initial thoughts on approach. A lot of details still need to be flushed out, but I am posting this to get some early feedback.
We do want to run the daemons as non-privileged users, and yet go with a setuid based approach to run tasks as a regular user. One approach that was proposed to do this is as follows:
The points above specify a broad approach. Please comment on whether this seems reasonable, reasonable in parts, or completely way off the mark. smile. Based on feedback, I would start implementing a prototype to flush out the details. +1 for a setuid program.
It should be written in C, not Java to ensure it has enough access to the platform to actually be secure. In particular, it has to clear both real and effective user ids. I'd like to see the proposed list of commands for the setuid program. No user-specified strings should be included on the command line, to avoid special character attacks. I agree with Sameer that we should have very tight permissions on the map output and task directories. One of the subcommands should probably be to move the outputs from somewhere like $task/output to somewhere like $tt/output/$job/$task. Having a plugin that lets us switch between the current pure-java implementation that doesn't change user ids and a setuid implementation sounds reasonable. We should continue to support the non-user-switch by default for clusters run by a single non-root user. Thanks for the comments, Owen.
Yes, I had that in mind. Specifically, I was planning to do something like setuid(getpwnam(user_name)->pw_uid). Since this would be done by a program running as superuser (the setuid exe), it would clear both the real and effective uids.
Sure, I will work on that and post the list here. In order to be reasonably complete, I think I should have a version that's working. So, I will start prototyping on the lines I described above. What about Cygwin / Windows users?
Reseting fix for version, as this will not make the feature freeze.
You are about to take on one of the big problems they hit in the grid world: identity. all the grid tools (condor, platform, etc) have lots of effort put in at the OS level to create new users on target machines, manage the disk and cpu usage limits of that user, etc. But you also need to propagate identity over the wire, which gets you into SAML and other things. Because right now the JobTracker trusts you to be who you say you are -having caller authentication would be a prerequisite to doing back-end user switching.
If you are interested in running pure Java apps under different rights, this could be done via a security manager. Every task would be started with an explicit security manager/policy that limited what it could do, file and network operations would be checked against the policy. This would be portable and easier to test. It also eliminates the need to run the TT as root, to keep the unix user database in sync with the hadoop user list, etc. Steve, I do agree that the security manager approach is simpler and portable. However, I think the requirement is also to support features like streaming. Given that, I think the security manager approach would not work. Am I right ?
Also, I agree that authentication and authorization is a pre-requisite for this. It is being handled by the other tasks under the jira HADOOP-4487. Here, I am focussing only on the mechanisms to make tasks run as the users who submitted the jobs - a small part of the larger framework. I have been able to make some progress and get a wordcount job to run as the job submitter. The design follows the basic approach mentioned above, minus the plugin abstraction, which I need to create yet.
A couple of things came up when doing this:
The code I have is very raw and needs lots of polishing even as a first draft. Will try to do so in a couple of days. Any comments on the approach so far ? Hemanth -you are right, for streaming/pipes stuff a second identity is needed. What some of the grid toolkits have done in the past is have some low-privilege user for running work; there isn't a 1:1 mapping of grid users to user accounts, instead the worker is allowed access to the relevant files of a user for a while, then at the end of the job, that data goes away. This eliminates some of the account management problems, though forces you to make sure that the worker doesnt have access to any old/shared data on the same filesystem.
From what I've heard, running under a security manager kills performance, which is pretty much a non-starter. Especially given that we need unix-level security anyways. In our environment, running as the real user is important. If I run a job, I should not be able to look at or kill your job's data or tasks, even if we are sharing a machine. Of course this feature needs to be optional, since:
1. It requires that all cluster users have accounts on all slave nodes in the cluster 2. It requires native code that may not work on all platforms 3. It requires root access, which not all Hadoop admins have. Hemanth,
This sounds good, but from a security standpoint, I think that it would be better to make the tasks more specific. So something like: CREATE_TASK_DIR owen task_20080101_0001_m_000001_1 It would also be good to have the task tracker root directories in a separate config file that can be owned by root. My goal is to make this executable as limited as possible. It should also block root as the user. What I do not want to see is having this work: MOVE_FILES root /tmp/foo /etc/passwd > Steve: have some low-privilege user for running work; there isn't a 1:1 mapping of grid users to user accounts
> Owen: running as the real user is important. If I run a job, I should not be able to look at or kill your job's data or tasks Might it be possible to have a pool of low-privileged users, to remove the requirement that every user has an account on every machine? Or maybe that requirement's not that onerous, with PAM/LDAP? The user who submits the job should be the user who runs the code on the compute nodes due to issues that surround the environment outside Haddop. For example, it is possible to submit a job that writes junk data to the low priv user's home dir. Without tracking who submitted that job, ops would never know who to go bonk on the head.
... and then there is streaming. I can think of instances where it might be useful to have generic accounts run stuff. In those instances, it is still much better to have that handled outside Hadoop. [Either through setuid scripts, roles, sudo, kinit a special keytab prior to job submit, whatever.] Let the OS/tool/ops team/whatever deal with the accounting in those situations.
Owen, sure this makes sense. Just one point is that I might need the job id along with the task id. But that's still within the same spirit. I will also make the other changes you have suggested like blocking root user. I've also added a CLEANUP_TASK_DIR command to the executable which is now able to cleanup the directories after task is completed. This is called from the CleanupQueue thread in task tracker. I had an offline discussion with Devaraj regarding the implementation, and we also went over the impact this would have when clubbed with JVM reuse.
A few comments from him that I am documenting here:
With JVM reuse though, the last point becomes problematic. We can easily setup the directories and move the output before and after the task. However, that needs to be done with a separate launch of the executable - three times actually. The performance impact this would have (and would it offset the advantage of JVM reuse) is something to measure and see. I have a version now that runs with JVM reuse enabled also. The main changes to get this to work was to correct the way I was figuring out the current task ids in the JvmManager class.
Also added a KILL_TASK command. This will look as follows: KILL_TASK user_name job_id task_attempt_id This will be called from JvmRunner.kill(), which in turn is called whenever a TT gets a kill task action. Since the JVM process is running as the job owner (different from the TT), we can't directly destroy the JVM process. Instead, what I've done is the following:
Any concerns with this approach ? Currently other than distributed cache, all other aspects of the task life cycle are functioning. I'll probably upload a single writeup (as Arun had done for And of course, follow that up with the first patch. smile That sounds reasonable. Please ensure that the pid file is owned by the user given on the command line and has permissions of 600. This avoids someone leaving this file writable and having someone point it at a different process.
Thanks, Owen. I will take care of that.
Some discussion is required for handling distributed cache (HADOOP-4493). Firstly, localized files from distributed cache are not localized per job. Since anything can be passed through distributed cache, I think it should support the same level of access control as the rest of the files. That is, they should be changed to be localized per job and subject to the same access control mechanisms we are using for the rest of the files - like output directories etc. I don't think this is a big impact for users as they can't assume the cache to contain the files they want on the nodes where the task is running. However, from the system perspective, probably if a lot of users (say working on the same project that requires the same data files) want to share this but across multiple jobs, we would be copying only once per node, saving both space and time. If we modify this to be localized per job, we could lose that advantage, no ? Any thoughts on this trade off ? I had an offline discussion with Sameer about how to get this patch in. To make it easier for reviewing, maybe it makes sense to split the task up into multiple sub tasks. Atleast 3 that are identified are:
In order to get a working launch and kill tasks patch though, the file and directory permissions will need to be opened up to allow access to all users. Each of the other patches will make it more secure. Please note that we have discussed the approach of how we will address directory and file permissions (such as intermediate outputs) in this JIRA already. This proposal is only to make it simpler to get some incremental patches in. Would this work ? If yes, I will use this JIRA to handle the first of the three tasks, then use HADOOP-4491 and HADOOP-4493 for the others.
We are taking care of this point in the setuid executable. One question is to determine how the location of this secure config file will known to the executable. Following are our options: Option 1: Read from the environment variable HADOOP_CONF_DIR Options 1 and 2 may allow users to launch the executable pointing to some custom path. Option 3 would completely avoid this, and make it more secure. For the sake of deployment, I think the setuid executable should be built using a separate ant target, as it would need to be setup as owned by root etc. So, maybe it is easy to do Option 3 in that case. If we decide to go with one of the other two options, we should mandate additional checks to make sure that the configuration file is owned by the root user, as Owen mentioned. Any comments ? The attached patch implements changes in the tasktracker to launch tasks using the setuid executable defined in
task-controller <user-name> <command-enum-value> <job-id> <task-id> <tasktracker-root>
As mentioned in comments above, this patch only handles launching and killing of tasks, and does not handle file and directory permissions securely. In fact, it opens up the permissions so that both the tasktracker and task can share files and directories. However, this change is only done when the feature is enabled, and does not affect the default Hadoop behavior. When HADOOP-4491 and other issues are fixed, secure permissions will be replaced. The changes in the patch include:
Tested this on a single node cluster, along with the setuid executable of I request a review for the same. A lot of discussion has happened on various comments in this JIRA. The attached document collates all of them. I hope this will make it easier to follow the approach and review the changes.
I attached a new patch that is more comprehensive. All changes from the previous patch still hold good. This one adds the correct permissions for all relevant files and directories, except distributed cache.
The previous patch only set relevant permissions on the task and log cache directories for all users, with the intent that tasks running as any user should be able to create and use other files and directories under them. This requirement still applies. However, there are other files and directories whose access needs to be adjusted too. The new patch addresses these changes:
Both the changes above are required in future as well. Except then, the permission string would be more restrictive (disallowing access to group and others). The previous patch was working because of a subtle behavior in setuid. On the systems where we tested, the umask was set such that read and execute permissions were provided to group by default. So, any of the job files created by the tasktracker had read and execute to the group to which the tasktracker user belonged. When the setuid executable switched users, it does not clear the supplementary group information of the launcher. Hence, the new process running as the job owner still had access to the groups to which the tasktracker belonged, and hence worked. Again, in HADOOP-4491, we propose to remove all access for the group ownership also, and hence this will not be an issue. The latest patch also adds correct permissions to files localized as part of the distributed cache. In order to do this, I introduced a new API in distributed cache to indicate whether the files were localized freshly (i.e. as part of the current task's localization), or whether they are already existing in cache. I use this API to avoid setting permissions on the same cache files repeatedly for each task. If there's a better way to do this, I would be glad to know that.
All changes made in the previous patch still hold good, except for minor refactoring. This patch is now complete to the best of my knowledge. Please do offer your comments. Hemanth, this is look good.
Some comments:
I've updated the patch to trunk, incorporating most of Arun's comments above. Arun, can you please take a look.
Done.
Done.
Done. I've added a new overloaded API that passes the information to DistributedCache. Just to keep options open, I've defined a new public class DistributedCacheFileAccessInfo - a simple class that can be used to define permissions and ownership information for localized files in DistributedCache. Can you take a specific look at this, and let me know if this looks OK ?
I've not done this one alone. It was not very clear what information is necessary at launch time. For e.g. if there are some localized files under the task cache directory that need to be loaded at launch time, we'll need permissions for these also. In general, it seemed a little risky to launch the JVM without giving full access to all jars etc, even if the Task will start running later only. So, I've left this as is. I think the main concern here was about the special check I had in JvmManager where I was avoiding setting the permissions again when getting the task to launch. This seems a simple enough check, and I've documented the rationale in code. Can you verify this again, and let me know your thoughts ? Attached a hopefully last version of the patch. This one is extensively tested and has fixed a couple of bugs related to incorrect assumptions about multiple mapred local directories. Thanks to Sreekanth and Amar for help in testing this. We're run randomwriter, sort with and without JVM reuse, and also streaming and using distributed cache.
The test-patch results are showing a -1 on release audit which I've written to core-dev about. I am not sure why a -1 is coming, will continue to debug that. There's also a -1 on tests. It is difficult to write unit tests for this patch since it requires support for multiple users. There are a lot of log statements in the patch which should be removed before commit. I am attaching this with the hope that someone can take a look at the changes. Arun ? Merged with trunk. It was broken by a commit done yesterday night. Also removed extraneous logs. Running ant test.
test-patch gives following output: [exec] -1 overall. The -1 on release audit is because jdiff has generated some changes to public classes and packages (DistributedCache, FileUtil and the filecache package). The release audit seems to be flagging these new jdiff changed files as warnings. I've cross checked the ASF license header is included for all new files I've put up. The -1 on tests is as explained above. There's no change in functionality to the last patch I uploaded. Only a merge. I also have to mention that some additional work would be required in the LinuxTaskController to sync up with changes from
However, the default path (covered by the DefaultTaskController) is not affected, and hence there will be no regression. I am proposing that apart from this, if other things are fine, we should commit ant test passes locally.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12399627/HADOOP-4490.patch against trunk revision 741330. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. -1 release audit. The applied patch generated 821 release audit warnings (more than the trunk's current 819 warnings). +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3806/testReport/ This message is automatically generated. The attached file fixes the streaming test failures. I'd made changes to FileUtil.java to run the chmod command using the ShellCommandExecutor. Previously this was using the Process class. There was no reason to change it, so I moved back to using Process and the tests passed locally.
I thought ant test runs contrib tests as well, which is why I missed these on the first patch. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12399725/HADOOP-4490.patch against trunk revision 741776. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3812/testReport/ This message is automatically generated. The failed contrib test is in Chukwa and is the same as
Some comments after a discussion with Hemanth:
Thanks for the review Arun.
I had a discussion with Sreekanth about the changes, and we are proposing the following:
The above changes will mean we can remove:
Arun, does this tie in with your expectations ?
On second thoughts - I'd go there right-away, since it means that we don't need to open it up to 777 for mapred.local.dir if the DefaultTaskController is in effect.
+1
+1
+1
+1 we were running with this patch and had problems with streaming scripts being set to 666. This patch fixes that to make all the files unjarred to be readable and executable by all.
Mahadev, thanks for debugging this issue. The problem was that the permissions being used in LinuxTaskController were making an assumption that files submitted to the cluster, say via streaming jobs, or in distributed cache, will already be executable if they need to be executed as part of the tasks. However, there seem to be scenarios in which this is not mandated, and the framework handles this in the default case (when user tasks are run as the tasktracker itself). The right fix is to modify the permissions without this assumption. If we do that, I guess the change in RunJar is not necessary. One more concern is that the RunJar change would fix the problem only for jar files, but not for other types of archives. We are testing this (with the failed streaming test case) with the modified permissions in LinuxTaskController. If that works, I suggest we just use the new version of the LinuxTaskController itself.
Again, thanks for debugging this issue ! Attaching patch incorporating Comments from Hemanth and Arun. The only change which has been done which move away from the above are:
Fixing, findbugs warning. Plus had made mistake in localizejob in LinuxTaskController corrected it.
One more change required. I've noticed that when the JVM Manager calls a killTaskJVM, the current working directory for the LinuxTaskController is set to a task attempt directory. This will fail, because this directory will no longer exist. While this is not creating problems as well behaved JVMs will exit themselves, it is not something to be relied on.
The relevant piece of code is this: private ShellCommandExecutor buildTaskControllerExecutor(TaskCommands command, String userName, List<String> cmdArgs, JvmEnv env) throws IOException { ... if (env != null) { shExec = new ShellCommandExecutor(taskControllerCmd, env.workDir, env.env); }else { shExec = new ShellCommandExecutor(taskControllerCmd); } } Setting env.workDir will set the working directory to that particular task attempt directory which may no longer exist. I get the following error when this happens: 2009-03-13 10:39:52,750 WARN org.apache.hadoop.mapred.LinuxTaskController: IOException in killing task: Cannot run program "/path/to/bin/task-controller" (in directory "/path/to/mapred-local/taskTracker/jobcache/job_200903130908_0051/attempt_200903130908_0051_m_001012_1000/work"): java.io.IOException: error=2, No such file or directory Attaching patch incorporating Hemanth's Comment. Tested the patch and exception was not thrown.
Found a problem while this patch was being tested. When a TT was being re-inited by JT after it's lost for some time, TT got the following NPE and it crashed completely. 2009-03-22 22:22:36,155 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.NullPointerException
at org.apache.hadoop.mapred.LinuxTaskController.buildTaskCommandArgs(LinuxTaskController.java:183)
at org.apache.hadoop.mapred.LinuxTaskController.killTaskJVM(LinuxTaskController.java:237)
at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.kill(JvmManager.java:401)
at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.stop(JvmManager.java:211)
at org.apache.hadoop.mapred.JvmManager.stop(JvmManager.java:61)
at org.apache.hadoop.mapred.TaskTracker.close(TaskTracker.java:925)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1815)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2899)
Attaching a patch which resolves the NullPointerException raised when TT is re-inted. The problem was with race condition in which JVMRunner was issued kill before it launched the task. The patch now checks if the inital context is set. i.e task is launched if not, it merely updates the data structures.
Attaching patch modifiying a minor thing, to check jvm env member of TaskControllerContext. The JVM env needs to be set for both controllers. So the check has been moved into JVMRunner.
Attaching latest patch with following changes:
Latest Merged patch.
Mixed up merging in the last patch. Attaching new patch which corrects the issue.
Some comments:
Comments on documentation:
Attaching patch addressing Review comments by Hemanth.
Uploading built pdf documentation for the change made by the patch for review.
Code changes look fine to me. Looking at the documentation changes.
Attaching new patch with documentation changes.
Attaching output of the ant test patch:
[exec]
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no tests are needed for this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12404914/hadoop-4490-9.patch against trunk revision 763247. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/166/testReport/ This message is automatically generated. The contrib test failures are not related to the patch.
Attaching new patch, the patch modifies the kill task part from the previous patch. This passes on the configuration value mapred.tasktracker.tasks.sleeptime-before-sigkill to task-controller binary. Current trunk version of the binary would ignore this value. Once fix for
Running thro' Hudson.
After offline discussion with Hemanth and Vinod the changes to LinuxTaskController for resolving
Reverting to previous version of patch. Attaching patch fixing an issue found while testing on large clusters while task trackers are re-inited.
Attaching patch with following changes:
I think we should move the call to initializeTask into JvmRunner.runChild(). This way it is more explicit why the check is required. Other than this +1. Please make this PA. Making the initalizeTask call explicit in runChild.
Attaching output of the ant test-patch
[exec]
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no tests are needed for this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] -1 release audit. The applied patch generated 652 release audit warnings (more than the trunk's current 651 warnings).
[exec]
[exec]
All core and contrib tests passed on the local machine except TestQueueCapacities which is being handled in seperate JIRA. The release audit warning is due to a new public API in FileUtil. This is expected.
This patch has been tested extensively manually, and a version of this patch is also deployed in production environment for a while now. There is a plan to develop unit tests as a followup. Given these, I am committing this patch, as it is blocking some other jiras in M/R. I just committed this. Thanks, Sreekanth !
Integrated in Hadoop-trunk #811 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/811/
Attached example for earlier version not to be committed.
Another patch needed if applying this to the 0.20 branch.
Attaching an example patch for branch 20 not to be committed.
Attaching new Yahoo! Distribution patch.
Editorial pass over all release notes prior to publication of 0.21. Subtask.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HADOOP-4451.