|
[
Permlink
| « Hide
]
Amar Kamat added a comment - 03/Nov/08 05:59 AM
It will be interesting to see how we will deal with intermediate data (map output). Even if we store intermediate data under user's permission, we will still have to open it for shuffle/reduce.
> It will be interesting to see how we will deal with intermediate data (map output).
I have opened a new JIRA for it. See Here's an overview of the problem(with most of the input from
Objective:
Problems to solve:
Issues involved:
I am summarizing the state-of-the-art local data management on TT
Localization of task on the TT:
Intermediate output files
Intermediate output serving
Task logs' serving
In summary, the current directory structure follows. Unless otherwise stated, directories/files are owned by TT but used by both TT and child. taskTracker
|- archive
|- jobcache
|- jobid
|- work
|- job.xml
|- jars/job.jar
|- taskid[.cleanup]
|- work
|- job.xml / task-specific
|- taskjvm.sh ( created and used by TT)
|- output ( owned by child, used by TT)
|- all intermediate output files ( owned by child, used by TT)
mapred.child.tmp ( owned and used by child)
hadoop.log.dir|userlogs (owned and used by child)
|- taskid (owned and used by child)
|- stdout ( owned by child, used by TT)
|- stderr ( owned by child, used by TT)
|- syslog ( owned by child, used by TT)
Some (broad) proposals for solving this issue:
Localization (A) Move the whole localization out of the taskTracker o be done as the user.
(B) Separate tt-only, child-only space from shared space. TT-only and child-only spaces are exclusively for the TT and the child respectively. TT does localization in tt-only area, task-controller binary then moves directory structure to the child only area. The shared space is for the stuff generated by the child for TT and has restricted access (511 on dirs and 444 on files) for TT and others. Even though other users can read this area, they won't be able to delete/write stuff.
(C) Instead of separating the directory structures completely, use the same for both TT and the user wherever necessary.
Intermediate output
Task logs
Depending on these, I think that even though (B) sacrifices some of the strict 700 restrictions to a more free 511/444, it keeps things simple. But I am open to other proposals too. Thoughts? Few points:
Here's what the patches do: common-patch:
mapreduce-patch:
I've tested the patch by running the LinuxTaskController specific tests and observing the directory structure over time. All the tests with successful, failed, killed jobs, with and without reuse pass in the sense that throughout a job's timeline, any path is either owned by the TT or the child and has the most secure permissions possible. On the patch that affects common, some comments:
1) RunJar.unjar should set the permissions for all the entries in the path after expansion 2) DiskChecker.java has lot of code to do with permissions handling to make it generic but not everything would be actually used. In fact, the approach taken in making some APIs generic is debatable. We might as well keep it simple for now and extend those APIs as and when required. 3) LocalDirAllocator.getSecureLocalPathForWrite could be renamed as getPrivateLocalPathForEWrite On the mapreduce part of the patch,
1) TaskTracker.getIntermediateOutputDirForChild can be renamed to getBaseIntermediateOutputDir 2) TaskTracker.initializeJobDirs should delete stuff on all the disks, but should return successfully if it is able to create the path on at least one disk 3) job-id/work should be user owned? 4) Check whether the call to mkdirs(workDir) in localizeJob is required? 5) There is a spurious LOG.info(" From fComf " + fConf.get("mapred.local.dir")); in localizeTask. Remove it. 6) Files created within private directories don't have to have 700 permissions. Remove all such permissions settings (e.g., this applies to localTaskFile in localizeTask) 7) The work-dir in the attempt directory should be readable/writable by all tasks that the JVM runs (in the jvm reuse case). Currently, jvmManager.finalizeTaskDirs will prevent this from happening. 8) It'd be good to remove redundant calls to finalizeTaskDirs in JvmManager. 9) It will be nice to combine the APIs for creating files/directories and setting appropriate permissions all in one API 10) Check if the configuration setting of mapred.local.dir for the child can be done in the TaskTracker before launching the task (instead of the task itself doing that) 11) There is a spurious LOG.info in Child. Remove that or reword that. I haven't looked at the TaskController/LinuxTaskController changes.. Documenting more things that need to be done. These came out in a short discussion with Hemanth/Sreekanth.
Adding patches for quick review. Still a work in progress. Incorporated most of Devaraj's comments.
Two major changes from the previous patch:
Long time, the correct solution is for this directory to be owned by the child, but cleaned up by TT using a LinuxTaskController binary lauch to swith user. Pending items from Devaraj's comments:
Extra:
There are some minor TODO's also left to do. Uploading patches that include the rest of the comments, except
There are very few occurrences of above now in the code. So dropping this. The patches can be reviewed, while I try any possible code reuse/refactoring and making test-cases comprehensive. Adding new patches:
Freezing up any further changes till review comments are put up. I would like to call out some approaches this patch is taking explicitly, so that there's an opportunity for discussion:
1. Setting secure permissions by default vs doing so only in the LinuxTaskControllerThe patch sets permissions for localized job files and directories as part of the localization process itself, that is as soon as the files are localized, rather than in the TaskController. I believe this approach was taken primarily to secure permissions on the localized files on the disk ASAP. The effect of doing so is that now all configurations of hadoop, including those which do not worry about security will have secure permissions for some of the files. While this should be OK in principle, it must be noted that without the entire solution offered by the LinuxTaskController, full security is not achieved. Hence, does it make sense to make changes to the permissions only when using the LinuxTaskController (with a small window where files could have potentially insecure permissions), and leave the default case alone ? Either way, I think the current method in which permissions are changed for the localized files causes these to be splattered across the code. It may lead to mistakes in future if somebody is making changes (say to localize a new file) and is not fully aware of the permissions issue. Ideally it should be abstracted from the process. In speaking with Vinod about this, I realized this may not be very easy to achieve the way LocalDirAllocator currently works. If we keep the changes to the task controller alone, I believe this will be very easy to accomplish. 2. Approach for sharing common files between TT and childThe patch juggles around with permissions on the localized and intermediate files by first setting ownership to the job owner, and after the task is complete, changing the ownership back to the tasktracker user. The reason it does this, is for two purposes: Note that change of ownership is done by launching the task-controller setuid binary as root privileges are required for the purpose. The above two problems can be solved in another way as well (and this is what was discussed on On the other hand, the current model seems to be simpler to fit into mind. If a task is not complete, its files are owned by the job owner, if not, they are owned by TT. It is easy to check these rules on a running system. So, is it worth the change in approach to bring in a little optimization (which might actually not matter that much). 3. Sandboxing task runtime environment by changing values of mapred.local.dir for the childThe patch 'sandboxes' the task runtime environment, by changing the mapred.local.dir values from the original configured values to ${original mapred.local.dir}/taskTracker/jobCache/${job-id}/${task-id} and pass the new values to the child. Vinod told me that this was primarily required because checks in the LocalDirAllocator require the current process to have write permissions on the context passed to it (which is basically the mapred local directories). When the child calls LocalDirAllocator, it would now fail because the original mapred local directories will not have write permissions to world. Hence, the need to sandbox. Of course, it is also probably more secure. Two questions this raises: Is this change in contract acceptable ? If yes, is it acceptable in the default case as well, or should it be restricted to the case of LinuxTaskController alone. Either way, the current patch, in order to implement this behavior juggles around the values of the mapred.local.dir variable in conf file at 2-3 locations in the tasktracker code. I think that approach is error prone and needs to change. Thoughts on these three points ? Some other comments on the code:
Some nits:
Uploading mapred part of the patch that incorporates most of the review comments.
– Done
Things to be done:
I looked at the C code that was pending from previous review. Looking good overall. I have a few comments - mostly minor:
I had some offline discussions with Hemanth/Owen/Arun. Those discussions resulted in a change of proposal for this JIRA.
The TaskTracker's directory structure will be like the following: $ttroot
|
|- distcache
|
|- users
| |
| '-- $jobid
| |
| |- jars
| | '-- job.jar
| |- work
| |
| `-- $attemptid
| |- job.xml
| |- taskjvm.sh
| |- work
| | '-- tmp
| `-- output
|
'-- system
|
`-- $jobid
|
|- job.xml
|
|- userfiles
| |- jars
| | '-- job.jar
| '-- work
|
|- $attemptid
| |- job.xml
| |- taskjvm.sh
| '-- work
| '-- tmp
|
'-- outputs
`-- $attemptid
Broadly, there are two directory strucutres - system and users
These changs will be needed for both DefaultTaskController as well as the LinuxTaskController. LinuxTaskController uses the setuid binary to do the move operations as the root and changing ownership of the target files to the user. Distributed cache files and the log files still need to be baked into this structure. After a little playing around with Linux, it looks like it blocks using hard links for directories, even for root. frown
That leaves us two options for dealing with the logs directory: I'm leaning toward the second one. $ttroot - mapred, 755
|
|- jobs - mapred 755
| |
| '-- $jobid - mapred 700 + rw access by $user
| |
| |- distcache - mapred 755
| |- jars - mapred 755
| | '-- job.jar
| `-- $attemptid - mapred 755
| |- job.xml
| |- taskjvm.sh
| |- work - $user 700
| |- logs - $user 755
| `-- output - $user 755
'-- system
|
`-- $jobid
|
|- job.xml
So all of the protection is at the $jobid level. The user only has write access to the work, logs, and output directories. Thoughts?
Are you referring to the ACLs that use extended attributes on the file system. Their support should be explicitly enabled in the kernel, and also supported by the file system type, right ? If yes, doesn't this approach make the utility of this feature much more restrictive ? It seems like option 1 is a slightly better option in that case... Owen and I had an offline discussion about this, and we felt another approach to try out was to see if we could have the directories and files owned by the user and group-owned by the tasktracker. The group ownership should be sticky so permissions are inherited. The permissions must apply for all the relevant components in the paths.
So, $jobid and $attemptid in the examples above would be owned by the user, group-owned by mapred, and have permissions like 570 or similar. This might also remove the need to have parallel directory structures. The rationale for this approach follows from the fact that for maximum security the task-controller executable needs to be group owned by the tasktracker (to prevent other users from launching it). Hence, this almost means that the tasktracker user is a special user in the system that is required for secure installations. And it can be setup such that the user is in a separate group on its own. Thoughts ? Going forward with the latest proposal.
Salient points:
child umask 0007 Uploading a new patch. The last patch for the common project holds.
The patch is functionally complete, but not the cleanest. Some of the minor review comments on the linux task-controller binary code are still in progress. This is looking much simpler and good. Some comments, mostly minor:
I have still to look at the task-controller changes, will take the latest patch for the same. One more point. I noticed that ReduceTask.ReduceCopier.configureClasspath() is using Task.getJobFile(). Will this be a problem given we are setting the value of the job file just before it is written to disk ?
Some comments on the task-controller code:
I missed out one more point. In initialize_job we should be setting the permissions as 2570 recursively. Only the work directory needs to open up write for the user.
Getting close. A few minor points:
I also looked at the Forrest documentation and the javadocs:
One more point which I discussed with Vinod. The changes in common are at this point just a simple convenience wrapper API around the java.io.File APIs for setting permissions. I propose we move this into the mapred patch itself and open a JIRA to discuss the changes to common more carefully.
Attaching patch including most of the above comments.
hadoop.log.dir is still needed to set permissions on userlogs/attempt-id directories to be writable by the user's tasks and readable by TT for serving via web UI.
Moved these changes into TaskTracker.PermissionsHandler. In future (via MAPREDUCE-759), this along with most of the remaining localization related code should be refactored into a separate class. Can we also weed out trace type log statements both in the the C and Java code, move them to a debug mode if needed ?
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12416027/HADOOP-4491-20090810.1.txt against trunk revision 802224. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 47 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/596/console This message is automatically generated. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12416038/HADOOP-4491-20090810.3.txt against trunk revision 802954. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 47 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 5 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/463/testReport/ This message is automatically generated. +1 for the changes, except a minor nit: there's a typo for the 'LOGFILE' as 'OGFILE' under the #ifdef DEBUG flag in the task-controller code.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12416158/HADOOP-4491-20090811.txt against trunk revision 803049. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 50 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/466/testReport/ This message is automatically generated. Attaching a patch that
I just ran the test cases that touch the changes in the last patch. I also generated docs and verified the documentation changes.
The contrib test failures reported by Hudson are unrelated. Plz see This patch is committable. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12416292/HADOOP-4491-20090812.txt against trunk revision 803231. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 50 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/471/testReport/ This message is automatically generated. Apart from the documented contrib tests, the core test failure is MAPREDUCE-441. Going to commit this patch now.
I just committed this. Thanks, Vinod !
Integrated in Hadoop-Mapreduce-trunk #47 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/47/
. Setup secure permissions for localized job files, intermediate outputs and log files on tasktrackers. Contributed by Vinod Kumar Vavilapalli. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||