|
[
Permlink
| « Hide
]
Doug Cutting added a comment - 10/Oct/07 10:22 PM
Maybe it should attempt to create '/user/<username>/' before it starts using '/'? I worry about '/' getting polluted on shared filesystems each time a new user comes online.
Creating '/user/<username>/' or throwing an exception would both be reasonable alternatives. This behavior can probably be FileSystem implementation dependent.
Automatically creating a working directory may become awkward with permissions and when delegating tasks to an agent. Further, creating a default place for a user's data should be part of adding that user to the system, not executing a task.
Requesting a "home" directory for a given set of credentials- rather than a default "working directory"- from a FileSystem seems more correct; the working directory seems like FileSystem state owned by an application (i.e. the FileSystem object). If one wants to resolve relative paths, the working directory must be set first on the particular instance. This way, relative Paths can only be resolved against a FileSystem where the working directory is set, absolute Paths are always OK, FileSystems can return a default directory for a given user (but not in general), and all Paths from a FileSystem are fully qualified (HADOOP-1909). At the moment, the working directory is set by the TaskTracker (to the property provided) and by IsolationRunner (for local, temporary storage). It is used sparingly, but notably by applications like FsShell and distcp (where reasonable defaults can be set and checked). Are there other places where this is relied on that might make effecting this change more difficult? > Automatically creating a working directory may become awkward [ ... ]
Yes, I agree. So we might instead:
Under that regime, folks who first try to run a mapreduce job with relative paths against an HDFS system they have not used will get a "working dir does not exist" exception, which sounds about right. Thoughts? I took a shot at implementing your idea. I left S3 alone; S3FileSystemBaseTest::testWorkingDirectory() looks like it has particular semantics I ought not to mess with.
It throws an IllegalArgumentException from setWorkingDirectory if the requested dir doesn't exist and IllegalStateException from getWorkingDirectory if it is unset (i.e. not found during initialize). It's assumed that the default working directory will almost always exist. It looks OK, but it sounded better. Was this what you had in mind? This looks mostly good to me. I don't understand some of the changes in the test code. Are these required for this issue?
> It looks OK, but it sounded better. Not sure what you mean. Do you have reservations about this approach? One thing I'm not wild about is the amount of duplicate code. I wonder whether much of this could be moved to the base FileSystem class. The only thing that's really FileSystem-specific is the default working dir. This might look something like: private Path workingDir = null;
public Path getWorkingDir() {
if (workingDir == null) {
throw new IllegalStateException("working dir unset");
}
return workingDir;
}
public void setWorkingDir(Path p) throws IOException {
if (!exists(p)) {
throw new IOException("working dir does not exist: " + p);
}
workingDir = p;
}
protected Path getDefaultWorkingDir() { return new Path("/"); }
Then most FileSystem implementations would just override getDefaultWorkingDir. FileSystem.java would call setWorkingDir(getDefaultWorkingDir()) after initialize(). Could that work? The changes to the test code were to accommodate the Trash- which attempts to use the working directory if the default path isn't absolute- in tests (TestReplicationPolicy, MiniDFSCluster), correcting a transaction count that was thrown off by creating a working directory in MiniDFSCluster (TestEditLog), and accommodating the new semantics of setWorkingDirectory (TestLocalDFS). If there's a less kludgy way to pass the first and second tests, I'd be open to it.
Not with the approach, but this impl wasn't fitting well with cases like that in HADOOP-1916, where FsShell might create a default working dir if it didn't exist. This is cleanly solved by adding the FileSystem::getDefaultWorkingDir() method you suggested. I'll rework the patch. Do you prefer throwing an IOException to an IllegalArgumentException in setWorkingDirectory? I went back and forth on that, but ultimately wanted to distinguish between problems with FileSystem::exists() and a non-existent path (e.g. RawInMemoryFileSystem). > Do you prefer throwing an IOException to an IllegalArgumentException in setWorkingDirectory?
I don't have a strong preference, but do prefer solutions with less code. I think using IOException results in less code, and would give it the nod for that reason. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12369642/2025-1.patch against trunk revision r595563. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs -1. The patch appears to introduce 1 new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1105/testReport/ This message is automatically generated. Accommodate findbugs (use explicit null instead of known null)
+1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12369704/2025-1.patch against trunk revision r595563. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1112/testReport/ This message is automatically generated. I'd assumed we'd put a call to setWorkingDir(getDefaultWorkingDir()) in FileSystem#get(), just after the call to initialize(). We might first check that it exists, and leave it unset otherwise, so that folks who never rely on the working dir won't fail unless they attempt to access it.
Then getDefaultWorkingDir() could be protected: no one should ever need to call it. Why do you have calls to getDefaultWorkingDir() instead of getWorkingDir()? Also, the default working dir for Hftp should be the same as for HDFS, not "/". I wonder if we should make that (/home/$user) the default for all filesystems, and then only override it in RawLocalFilesystem and FilterFileSystem. That would further simplify the implementation of most FileSystems.
Re: Hftp: yes, that should default to /user/$user, with DistributedFileSystem; I'll correct that. KosmosFileSystem also differs from this default... I'm tempted to just leave it abstract, but if you'd prefer a default in FileSystem to /user/$user that's fine.
I left it public so the Trash would default to /user/$user instead of the cwd; it also leaves open the possibility of applications (like FsShell) creating the default working directory if it doesn't exist or permitting operations like "dfs -mkdirs ." . Further, the Trash is initialized before the FileSystem, so a relative path needs to resolve to something before initialize() is called. Things are similar in JobTracker initialization, where we encounter an infinite loop (exercised by TestSocketFactory) if the system dir is relative and the working dir is unset (since the default is absolute, using it here seems appropriate since we create the working dir as a side-effect). Finally, to preserve InMemoryFileSystem's init semantics, it seems necessary to set the working dir in initialize(), not after it (clearly a workaround is possible, but at first glance it seems less attractive than the current patch). Have I misread this? > I left it public so the Trash would default to /user/$user instead of the cwd [ ... ]
We should uniformly resolve relative paths to the connected directory, no? > it also leaves open the possibility of applications (like FsShell) creating the default working directory if it doesn't exist or permitting operations like "dfs -mkdirs ." Let's revisit that then. We should err on the conservative side when it comes to adding new public methods. And if we want to later auto-create working directories, we can do that in getDefaultWorkingDir() itself if we want to make that feature FileSystem-specific, or in FileSystem#get() if we want to make it universal. So I don't see that making getDefaultWorkingDir() public is required to implement this. > Further, the Trash is initialized before the FileSystem [...] Things are similar in JobTracker initialization, where we encounter an infinite loop [...] Can you please elaborate on this? What're the methods in the loop? Are there other ways to break it besides having folks use getDefaultWorkingDir() to resolve relative paths? Why are Trash and JobTracker any different from other uses of FileSystem that must resolve relative paths? Autocreating working directors.
This will not work with permissions (hadoop 1298) because the client side may not have permission to create the working dir. Eg. /users is NOT world writable. Throwing an exception when a working dir path is used but does not exist is probably a better solution. > the client side may not have permission to create the working dir.
Then that would throw an exception. Or we could explicitly check for that, as you suggest, if you think that would yield a more user-friendly exception. > This would require that admins to religiously create home dirs (and trash-bins) for all its users. Home dirs, yes, but, if the home dir exists, can't a user create his own trash there on demand? > have the job framework check that the working dir exists before starting the job/task. Do mapred jobs require that the working dir exist? They require that the input exists and is readable, and that the output location is writable. Now that we have permissions these checks could be improved. FileInputFormat#validateInput() could check readability. And OutputFormatBase#checkOutputSpecs() could check that the parent of the output directory is writable. That's probably a separate issue, no? > They require that the input exists and is readable, and that the output location is writable.
I've added HADOOP-2528 for this. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||