|
[
Permlink
| « Hide
]
Owen O'Malley added a comment - 09/Sep/08 10:25 PM
I don't see any value over FileSystem.listStatus. It would just be a redundant method.
I'm thinking as more of a utility method since otherwise, callers have to implement the logic to figure this out. so, for example, to implement this in fuse, I would actually have to write logic in c to decide if a uid/gid combination are allowed to access a file. Similarly, fuse-j-hdfs would have to implement this in java.
and others too would have to do it. Ok, that is a very different patch. If you just want to create a utility function in FileUtil that looks like:
boolean access(Path path, FsAction perm) throws IOException; It might makes sense, but it is probably a one liner. I'm still -1 to implementing the jira as spec'ed. I'm not sure I did a good job describing the functionality. It should do a getFileStatus but then look at the current user/groups and have the logic to decide if the user is allowed to access the file. Really api request should have been
// given the permissions/groups/owner in FileStatus and the mode, output what the passed in user is allowed to do to the file int access(FileStatus f, int mode, String uid, String groups[]); but, as you point out, the client side can implement this itself. But, it does have to loop through the groups and such know the Hadoop permission logic. Thinking about this more,
access Of course https://issues.apache.org/jira/browse/HADOOP-4044 I think this is a valid JIRA.
A client side program should not have to re-implement filesystem logic of determining whether a particular user is allowed to access a path. The alternative of doing a stat and looking at the permission bits is not a general-purpose solution, because different file systems can implement access checks differently (irrespective of the mode bits on the file). . The LocalFileSystem may enforce seteuid whereas HDFS might not care. Moreover, when a file system uses kerberos authentication, the mode bits do not make sense any more. An analogy is that all UNIX file system VFS API has an access() method solely for this reason. The UNIX-OS does not interpret the mode bits, it allows the underlying file system to determine access. What goes wrong with a utility function that goes like:
public static boolean access(FileSystem fs, Path p, FsAction perm) throws IOException { UserGroupInformation user = UserGroupInformation.getCurrentUGI(); FileStatus stat = fs.listStatus(p)[0]; FsAction filePerm = null; if (user.getUserName().equals(stat.getOwner())) { filePerm = stat.getPermission().getUserAction(); } else { for(String group: user.getGroups()) { if (group.equals(state.getGroup())) { filePerm = stat.getPermission().getGroupAction(); break; } if (filePerm == null) { filePerm = stat.getPermission().getOtherAction(); } return filePerm.implies(perm); } What goes wrong? > public static boolean access(FileSystem fs, Path p, FsAction perm)
This doesn't need to be static, but could rather be: public boolean access(Path p, FsAction perm) ... with the same default definition you provide. Then a FileSystem impl could override it if needed. I think the access(..) method for HDFS cannot be implemented in client side since we cannot determine whether the current user is a superuser.
Hi Nicholas, Yes your point is absolutely correct. I think the access() method can be implemented only inside the file system. Each file system can/will implement access checks differently.
Maybe it could be added to the FileStatus:
public class FileStatus { ... public FsAction getUserAccess(); } Thoughts? this is something like:
public class FileStatus { + private FsAction userAccess; + public fsAction getUserAccess() { return userAccess; } + public void setUserAccess(FsAction userAccess) { this.userAccess = userAccess; } Right? And the filesystem API for getFileStatus will set it? +1 I am assuming there are no performance implications of creating this on every getFileStatus which there shouldn't be I wouldn't think. one minor issue is that getFileStatus is abstract, so we need to implement it for both RawLocalFileSystem and DFS. I think it's just those 2. I assume for now we want to give the same semantics to both.
How does this handle the superuser case Nicholas and Dhruba mentioned above?
To me FileSystem.access(path, mode) seems more intuitive. Edit : I might have missed some context. The above can be ignored.
I think it would be nice whenever possible to not invent new api/abstractions, so the POSIX model is nice to keep. Turning my performance note on its head, when one wants to just get access, doing everything else for constructing a FileStatus object is a waste. And also when doing a readdir on a directory with 30,000 files, do I really need the access permissions of every file? In practice because fuse (the first use case) has to support ls --color=auto, it has to cache stat objects and support access being called for every file in a directory after doing a readdir, so having the access result in the stat object is helpful.
Clearly when run by the super-user, they will always have read/write access.
But it is more important to write a good Java API that is internally consistent than slavishly follow a C API that was written 40 years ago. > Clearly when run by the super-user, they will always have read/write access.
hmm.. thats a fundamental change in definition of FileStatus. We had brief discussion of this when permissions were added. This jira seems to add more functionality to FileStatus (very commonly accessed everywhere) just to support access() (relatively rarely used). -1 from me.
Yes, I agree - meant all things being equal. But I do think we need to take some consideration that we're able to efficiently implement posix APIs, which this proposal clearly does The FileStatus already contains all of the information to compute this, with the exception of the superuser name. So in HDFS, if we add the superuser's name to HDFS's FileStatus implementation, then access can be computed entirely on the client from the FileStatus, with no added cost on the server. The default implementation on FileStatus could be like Owen's utility method above. HDFS would override this for its FileStatus implementation to check if the current user is superuser and otherwise return super.getAccess().
+1 for the default implementation, of course. I don't think adding more fields to FileStatus for this reason is a good idea.
To Owen's comment about APIs, the File API has canRead and canWrite and it seems the FileStatus object provides a lot of this functionality so in that sense, owen's proposal is consistent with Java io.
On a side note, I suspect that once we have real authentication, we will also have something like:
public abstract class FileSystem { ... public abstract UserGroupInformation getUserInformation() throws IOException; } because the user for each file system can be different. If we do that, it would make sense to include whether it is a super user in the UGI and the previous suggestion of doing it on the client side would work again. smile So I'd be ok with: public abstract class FileSystem { ... public FsAction access(Path p) throws IOException { ... } } so to check read access it would be: fs.access(path).implies(FsAction.READ); The advantage over a single boolean would be if you wanted to detect read-write vs read-only vs nothing. FsAction perm = fs.access(p); if (perm.implies(FsAction.READ_WRITE)) { // handle read/write case } else if (perm.implies(FsAction.READ)) { // handle read only case } else { // handle no access }
+1 didn't mean the exact File API as that would definitely be very inconvenient. But the general idea that you could implement most of the informational functionality of the File API with a FileStatus object is in my opinion a nice property. I am also ok with the optimization of not including the perms in every FileStatus object. It's less preferable on my side, but is still a good API.
> public FsAction access(Path p) throws IOException { ... }
This requires an RPC per file when processing directories. I think it's better to hang this method off of FileStatus, so that requests to the remote FileSystem can be batched for directories. how does one know whethere the filesystem is enforcing ACLs? should access always return true when it is not ?
hi doug,
is the FileStatus object implementation dependent on the FS implementation? If not, how would individual filesystems implement this? Wondering about the design of Path and FileStatus in that combined they provide the functionality of a java io file object. thanks, pete > is the FileStatus object implementation dependent on the FS implementation?
It certainly can be. A FileStatus instance should always be created by a FileSystem implementation, so it is free to return a subclass of FileStatus. I don't think any FileSystems currently take advantage of this. > Wondering about the design of Path and FileStatus in that combined they provide the functionality of a java io file object. The path is just the file name, while the status contains data about that name. One could cache status data for a path to minimize RPCs. HDFS used to do this. MapReduce got written assuming that certain things were cached (like isDirectory(), getLength(), etc.). But not all FileSystem implementations cached these. In particular S3 did not. So MapReduce performance on S3 bad. Also, once HDFS supported append, the cache would get out of date and it was never invalidated. So we moved to a model where the status is explicitly fetched, with no internal caching of path->status. Thus the API better models the underlying RPCs of both HDFS and S3, and there are fewer performance surprises. Note that this is somewhat analogous with the Unix 'stat' system call. I would guess that 'ls -l' does not call 'stat' for each column, but once for each file. In Unix you want to minimize system calls, and in Hadoop you want to minimize RPCs. Dhruba and I discussed this offline and here's what we would like to propose:
1. Add a private field superuser to FileStatus and ensure it is set in all the FileStatus constructors (and FileStatus still doesn't have any logic pertaining to a specific user - ie its still really a stat object) FsAction access(FileStatus f) throws IOException;
FileSystems can override the access method and no need for a RPC as you are passed in the FileStatus. Since this is computed on the client-side, the user is known to the FS. I think this incorporates all the discussed points. > Add a private field superuser to FileStatus [ ... ]
Is a variable superuser name really a universal concept that we want to add to every FileStatus? I'd still prefer that we add a FileStatus method:
The default definition would be like Owen's above, but HDFS would override it to take account of the superuser. HDFS would return a subclass of FileStatus that included the superuser's name. What's the problem with that? > 1. Add a private field superuser to FileStatus and ensure it is set in all the FileStatus constructors
Looks like not so useful hack to me and might not even work with kerberos etc. Any way I guess there has been enough discussion on this... I could not see need for changing FileStatus for this api alone.
This design accommodates both of these points. If we would like we can have the default implementation not include the notion of a superuser and have hdfs override that. So, then is there agreement on: FsAction FileSystem.access(FileStatus f) throws IOException;
With the default implementation being Owen's code and then we would subclass FileStatus for HDFS to include the superuser?
I'm not sure I understand, are you proposing something else or not to implement this? thanks, pete > are you proposing something else or not to implement this?
I prefer a FileStatus method to a FileSystem method, but that's not a huge deal. > I'm not sure I understand, are you proposing something else or not to implement this
I guess I meant we don't need to change FileStatus to implement the API you proposed. I doubt if extra RPC is a big deal for this. Not that it matters, API wise, I would have just had access(Path), but does not matter now.
FileStatus so we don't have to do an RPC, but if we went with FileStauts implementing it, we would also have to do the RPC. > if we went with FileStauts implementing it, we would also have to do the RPC.
No, that RPC has already been done, the one that returned the FileStatus. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||