Issue Details (XML | Word | Printable)

Key: HADOOP-4108
Type: New Feature New Feature
Status: Reopened Reopened
Priority: Major Major
Assignee: Pete Wyckoff
Reporter: Pete Wyckoff
Votes: 0
Watchers: 8
Available Workflow Actions

Submit Patch
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

FileSystem support for POSIX access method

Created: 08/Sep/08 04:00 AM   Updated: 16/Sep/08 11:31 PM
Return to search
Component/s: fs
Affects Version/s: None
Fix Version/s: None

Time Tracking:
Not Specified

Issue Links:
Blocker
 
Dependants
 


 Description  « Hide
From man access:
int access(const char *pathname, int mode);

DESCRIPTION
access checks whether the process would be allowed to read, write or test for existence of the file (or other file system object) whose name is pathname. If pathname is a symbolic link permissions of the file referred to by this symbolic link are tested.

mode is a mask consisting of one or more of R_OK, W_OK, X_OK and F_OK.

R_OK, W_OK and X_OK request checking whether the file exists and has read, write and execute permissions, respectively. F_OK just requests checking for the existence of the file.

The tests depend on the permissions of the directories occurring in the path to the file, as given in pathname, and on the permissions of directories and files referred to by symbolic links encountered on the way.

The check is done with the processâs real uid and gid, rather than with the effective ids as is done when actually attempting an operation. This is to allow set-UID programs to
easily determine the invoking userâs authority.

Only access bits are checked, not the file type or contents. Therefore, if a directory is found to be "writable," it probably means that files can be created in the directory,
and not that the directory can be written as a file. Similarly, a DOS file may be found to be "executable," but the execve(2) call will still fail.

If the process has appropriate privileges, an implementation may indicate success for X_OK even if none of the execute file permission bits are set.

RETURN VALUE
On success (all requested permissions granted), zero is returned. On error (at least one bit in mode asked for a permission that is denied, or some other error occurred), -1 is
returned, and errno is set appropriately.



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Owen O'Malley added a comment - 09/Sep/08 10:25 PM
I don't see any value over FileSystem.listStatus. It would just be a redundant method.

Pete Wyckoff added a comment - 09/Sep/08 10:31 PM
I'm thinking as more of a utility method since otherwise, callers have to implement the logic to figure this out. so, for example, to implement this in fuse, I would actually have to write logic in c to decide if a uid/gid combination are allowed to access a file. Similarly, fuse-j-hdfs would have to implement this in java.

and others too would have to do it.


Owen O'Malley added a comment - 09/Sep/08 11:03 PM
Ok, that is a very different patch. If you just want to create a utility function in FileUtil that looks like:
boolean access(Path path, FsAction perm) throws IOException;

It might makes sense, but it is probably a one liner. I'm still -1 to implementing the jira as spec'ed.


Pete Wyckoff added a comment - 09/Sep/08 11:07 PM
I'm not sure I did a good job describing the functionality. It should do a getFileStatus but then look at the current user/groups and have the logic to decide if the user is allowed to access the file. Really api request should have been
// given the permissions/groups/owner in FileStatus and the mode, output what the passed in user is allowed to do to the file
int access(FileStatus f, int mode, String uid, String groups[]);

but, as you point out, the client side can implement this itself. But, it does have to loop through the groups and such know the Hadoop permission logic.


Pete Wyckoff added a comment - 10/Sep/08 07:14 AM
Thinking about this more,
access
is filesystem specific (hdfs, s3, kosmix, ext3 may all have different rules). a filesystem may or may not respect setuid bits, need to access the entire Path hierarchy and in addition, need to follow symlinks which is not possible to do efficiently from the client side.

Of course https://issues.apache.org/jira/browse/HADOOP-4044 talk about needing the client to sometimes resolve an external link complicates this.


dhruba borthakur added a comment - 10/Sep/08 09:03 PM
I think this is a valid JIRA.

A client side program should not have to re-implement filesystem logic of determining whether a particular user is allowed to access a path. The alternative of doing a stat and looking at the permission bits is not a general-purpose solution, because different file systems can implement access checks differently (irrespective of the mode bits on the file). . The LocalFileSystem may enforce seteuid whereas HDFS might not care. Moreover, when a file system uses kerberos authentication, the mode bits do not make sense any more.

An analogy is that all UNIX file system VFS API has an access() method solely for this reason. The UNIX-OS does not interpret the mode bits, it allows the underlying file system to determine access.


Owen O'Malley added a comment - 10/Sep/08 09:52 PM
What goes wrong with a utility function that goes like:
public static boolean access(FileSystem fs, Path p, FsAction perm) throws IOException {
  UserGroupInformation user = UserGroupInformation.getCurrentUGI();
  FileStatus stat = fs.listStatus(p)[0];
  FsAction filePerm = null;
  if (user.getUserName().equals(stat.getOwner())) {
    filePerm = stat.getPermission().getUserAction();
  } else {
    for(String group:  user.getGroups()) {
      if (group.equals(state.getGroup())) {
        filePerm = stat.getPermission().getGroupAction();
        break;
      }
  if (filePerm == null) {
    filePerm = stat.getPermission().getOtherAction();
  }
  return filePerm.implies(perm);
}

What goes wrong?


Doug Cutting added a comment - 10/Sep/08 10:21 PM
> public static boolean access(FileSystem fs, Path p, FsAction perm)

This doesn't need to be static, but could rather be:

public boolean access(Path p, FsAction perm) ...

with the same default definition you provide. Then a FileSystem impl could override it if needed.


Tsz Wo (Nicholas), SZE added a comment - 11/Sep/08 12:23 AM
I think the access(..) method for HDFS cannot be implemented in client side since we cannot determine whether the current user is a superuser.

dhruba borthakur added a comment - 11/Sep/08 06:01 AM
Hi Nicholas, Yes your point is absolutely correct. I think the access() method can be implemented only inside the file system. Each file system can/will implement access checks differently.

Owen O'Malley added a comment - 11/Sep/08 06:37 AM
Maybe it could be added to the FileStatus:
public class FileStatus {
  ...
  public FsAction getUserAccess();
}

Thoughts?


dhruba borthakur added a comment - 11/Sep/08 05:46 PM
+1 to Owen's proposal.

Pete Wyckoff added a comment - 11/Sep/08 05:57 PM
this is something like:
public class FileStatus {
  + private FsAction userAccess;
  + public fsAction getUserAccess() { return userAccess; }
  + public void setUserAccess(FsAction userAccess) { this.userAccess = userAccess; }

Right? And the filesystem API for getFileStatus will set it? +1 I am assuming there are no performance implications of creating this on every getFileStatus which there shouldn't be I wouldn't think.


Pete Wyckoff added a comment - 11/Sep/08 06:08 PM
one minor issue is that getFileStatus is abstract, so we need to implement it for both RawLocalFileSystem and DFS. I think it's just those 2. I assume for now we want to give the same semantics to both.

Raghu Angadi added a comment - 11/Sep/08 06:47 PM - edited
How does this handle the superuser case Nicholas and Dhruba mentioned above?

To me FileSystem.access(path, mode) seems more intuitive.

Edit : I might have missed some context. The above can be ignored.


Pete Wyckoff added a comment - 11/Sep/08 07:04 PM

To me FileSystem.access(path, mode) seems more intuitive.

I think it would be nice whenever possible to not invent new api/abstractions, so the POSIX model is nice to keep.

Turning my performance note on its head, when one wants to just get access, doing everything else for constructing a FileStatus object is a waste. And also when doing a readdir on a directory with 30,000 files, do I really need the access permissions of every file?

In practice because fuse (the first use case) has to support ls --color=auto, it has to cache stat objects and support access being called for every file in a directory after doing a readdir, so having the access result in the stat object is helpful.


Owen O'Malley added a comment - 11/Sep/08 08:35 PM

How does this handle the superuser case Nicholas and Dhruba mentioned above?

Clearly when run by the super-user, they will always have read/write access.

I think it would be nice whenever possible to not invent new api/abstractions, so the POSIX model is nice to keep.

But it is more important to write a good Java API that is internally consistent than slavishly follow a C API that was written 40 years ago.


Raghu Angadi added a comment - 11/Sep/08 08:45 PM
> Clearly when run by the super-user, they will always have read/write access.

hmm.. thats a fundamental change in definition of FileStatus. We had brief discussion of this when permissions were added.
FileStatus is for the property of a file not related to a user. But this jira seems to change that to properties that effectively apply to the user requesting it. I am not sure if that is a good direction.

This jira seems to add more functionality to FileStatus (very commonly accessed everywhere) just to support access() (relatively rarely used).

-1 from me.


Pete Wyckoff added a comment - 11/Sep/08 09:04 PM

But it is more important to write a good Java API that is internally consistent than slavishly follow a C API that was written 40 years ago.

Yes, I agree - meant all things being equal. But I do think we need to take some consideration that we're able to efficiently implement posix APIs, which this proposal clearly does


Doug Cutting added a comment - 11/Sep/08 09:09 PM
The FileStatus already contains all of the information to compute this, with the exception of the superuser name. So in HDFS, if we add the superuser's name to HDFS's FileStatus implementation, then access can be computed entirely on the client from the FileStatus, with no added cost on the server. The default implementation on FileStatus could be like Owen's utility method above. HDFS would override this for its FileStatus implementation to check if the current user is superuser and otherwise return super.getAccess().

Raghu Angadi added a comment - 11/Sep/08 09:13 PM
+1 for the default implementation, of course. I don't think adding more fields to FileStatus for this reason is a good idea.

Pete Wyckoff added a comment - 11/Sep/08 09:22 PM
To Owen's comment about APIs, the File API has canRead and canWrite and it seems the FileStatus object provides a lot of this functionality so in that sense, owen's proposal is consistent with Java io.

Owen O'Malley added a comment - 11/Sep/08 09:29 PM
On a side note, I suspect that once we have real authentication, we will also have something like:
public abstract class FileSystem {
...
public abstract UserGroupInformation getUserInformation() throws IOException;
}

because the user for each file system can be different. If we do that, it would make sense to include whether it is a super user in the UGI and the previous suggestion of doing it on the client side would work again. smile So I'd be ok with:

public abstract class FileSystem {
...
public FsAction access(Path p) throws IOException { ... }
}

so to check read access it would be:

fs.access(path).implies(FsAction.READ);

The advantage over a single boolean would be if you wanted to detect read-write vs read-only vs nothing.

FsAction perm = fs.access(p);
if (perm.implies(FsAction.READ_WRITE)) {
  // handle read/write case
} else if (perm.implies(FsAction.READ)) {
  // handle read only case
} else {
  // handle no access
}

Pete Wyckoff added a comment - 11/Sep/08 09:40 PM

The advantage over a single boolean would be if you wanted to detect read-write vs read-only vs nothing.

+1 didn't mean the exact File API as that would definitely be very inconvenient. But the general idea that you could implement most of the informational functionality of the File API with a FileStatus object is in my opinion a nice property.


Pete Wyckoff added a comment - 11/Sep/08 09:44 PM
I am also ok with the optimization of not including the perms in every FileStatus object. It's less preferable on my side, but is still a good API.

Doug Cutting added a comment - 11/Sep/08 10:31 PM
> public FsAction access(Path p) throws IOException { ... }

This requires an RPC per file when processing directories. I think it's better to hang this method off of FileStatus, so that requests to the remote FileSystem can be batched for directories.


Pete Wyckoff added a comment - 12/Sep/08 11:19 PM
how does one know whethere the filesystem is enforcing ACLs? should access always return true when it is not ?

Pete Wyckoff added a comment - 15/Sep/08 03:06 AM
hi doug,

is the FileStatus object implementation dependent on the FS implementation? If not, how would individual filesystems implement this?

Wondering about the design of Path and FileStatus in that combined they provide the functionality of a java io file object.

thanks, pete


Doug Cutting added a comment - 15/Sep/08 06:21 PM
> is the FileStatus object implementation dependent on the FS implementation?

It certainly can be. A FileStatus instance should always be created by a FileSystem implementation, so it is free to return a subclass of FileStatus. I don't think any FileSystems currently take advantage of this.

> Wondering about the design of Path and FileStatus in that combined they provide the functionality of a java io file object.

The path is just the file name, while the status contains data about that name. One could cache status data for a path to minimize RPCs. HDFS used to do this. MapReduce got written assuming that certain things were cached (like isDirectory(), getLength(), etc.). But not all FileSystem implementations cached these. In particular S3 did not. So MapReduce performance on S3 bad. Also, once HDFS supported append, the cache would get out of date and it was never invalidated. So we moved to a model where the status is explicitly fetched, with no internal caching of path->status. Thus the API better models the underlying RPCs of both HDFS and S3, and there are fewer performance surprises.

Note that this is somewhat analogous with the Unix 'stat' system call. I would guess that 'ls -l' does not call 'stat' for each column, but once for each file. In Unix you want to minimize system calls, and in Hadoop you want to minimize RPCs.


Pete Wyckoff added a comment - 16/Sep/08 09:14 PM - edited
Dhruba and I discussed this offline and here's what we would like to propose:

1. Add a private field superuser to FileStatus and ensure it is set in all the FileStatus constructors (and FileStatus still doesn't have any logic pertaining to a specific user - ie its still really a stat object)
2. Add the utility API:

FsAction access(FileStatus f) throws IOException;
to FileSystem using the code Owen added to the JIRA.

FileSystems can override the access method and no need for a RPC as you are passed in the FileStatus. Since this is computed on the client-side, the user is known to the FS.

I think this incorporates all the discussed points.


Doug Cutting added a comment - 16/Sep/08 09:30 PM
> Add a private field superuser to FileStatus [ ... ]

Is a variable superuser name really a universal concept that we want to add to every FileStatus?

I'd still prefer that we add a FileStatus method:

  • FsAction getAccess() throws IOException;

The default definition would be like Owen's above, but HDFS would override it to take account of the superuser. HDFS would return a subclass of FileStatus that included the superuser's name. What's the problem with that?


Raghu Angadi added a comment - 16/Sep/08 09:34 PM
> 1. Add a private field superuser to FileStatus and ensure it is set in all the FileStatus constructors

Looks like not so useful hack to me and might not even work with kerberos etc. Any way I guess there has been enough discussion on this... I could not see need for changing FileStatus for this api alone.


Pete Wyckoff added a comment - 16/Sep/08 10:21 PM

Is a variable superuser name really a universal concept that we want to add to every FileStatus?

This design accommodates both of these points. If we would like we can have the default implementation not include the notion of a superuser and have hdfs override that.

So, then is there agreement on:

FsAction FileSystem.access(FileStatus f) throws IOException;

With the default implementation being Owen's code and then we would subclass FileStatus for HDFS to include the superuser?

Looks like not so useful hack to me and might not even work with kerberos etc. Any way I guess there has been enough discussion on this... I could not see need for changing FileStatus for this api alone.

I'm not sure I understand, are you proposing something else or not to implement this?

thanks, pete


Doug Cutting added a comment - 16/Sep/08 10:31 PM
> are you proposing something else or not to implement this?

I prefer a FileStatus method to a FileSystem method, but that's not a huge deal.


Raghu Angadi added a comment - 16/Sep/08 10:39 PM
> I'm not sure I understand, are you proposing something else or not to implement this

I guess I meant we don't need to change FileStatus to implement the API you proposed. I doubt if extra RPC is a big deal for this.

Not that it matters, API wise, I would have just had access(Path), but does not matter now.


Pete Wyckoff added a comment - 16/Sep/08 11:02 PM

Not that it matters, API wise, I would have just had access(Path), but does not matter now.

FileStatus so we don't have to do an RPC, but if we went with FileStauts implementing it, we would also have to do the RPC.


Doug Cutting added a comment - 16/Sep/08 11:31 PM
> if we went with FileStauts implementing it, we would also have to do the RPC.

No, that RPC has already been done, the one that returned the FileStatus.