|
Currently on our bigger grids, we have a significant amount of files that we aren't sure whether anyone is actually using or not (e.g., /tmp). While I recognize that atime is a huge performance killer, whenever one deals with users who have free reign over their space, it is an incredibly important tool to help maintain the system.
This is especially important given the lack of ACLs. On our larger grids, there are many files that are just kind of scattered all over that we have no real insight as to their purpose, much less their usage pattern. All we know is that we didn't put them there. Having an access log would at least tell us whether something is being used. Once users get added to the system, having the user information combined with whether a file was touched will be extremely handy. Operationally, I see this being used by dumping the data on a regular interval into an RDBMS or perhaps even inside HDFS itself. It is then fairly trivial to create tools and form policies around data retention. I am not too worried about data structure size. But if we have to write a transaction for every file access, that could be a performance killer, do you agree? My proposal was to somehow batch access times updates to the disk (using the access bit). Using the access-bit and a cron job that runs every hour could effectively provide a coarse-grain-access-time.
> if we have to write a transaction for every file access, that could be a performance killer, do you agree?
I don't know. Logging file opens should be good enough, right? How much is transaction logging a bottleneck currently? How much worse would this make it? If files average ten or more blocks, and we're reading files not much more often than we're writing them, then the impact might be small. Another option to consider is making this a separate log that's buffered, since its data is not as critical to filesystem function. We could flush the buffer every minute or so, so that when the namenode crashes we'd lose only the last minute of access time updates. Might that be acceptable? > Another option to consider is making this a separate log that's buffered
This is a pretty good option. It should cost almost nothing to write the access to a buffered log. I suspect that for the use case Allen describes it would be sufficient to discover the outliers i.e. files and directories that haven't been accessed in months. I agree that we can record the timestamp when an "open" occurred in namenode memory, mark the inode as dirty and then allow a async flush daemon to persist these dirty inodes to disk once every period. That should be optimal and should be light-weight.
Working on some job, i have realized that nearly all the MR jobs somehow create temporary directories/files. Due to various reasons(program bug, program crash, etc. ), these may not be deleted properly. We may add a FileSystem#createTempFIle() and attach it with a timestamp. And after some period, these can deleted by batch tasks.
Now that we have permissions and ownerships, such an access log would also need to differentiate between reads, writes, ownership changes, and permission changes. This would be extremely helpful in case forensics need to be performed to track down someone doing Bad Things(tm).
I plan on doing the following:
1. Add a 4 byte field to the in-memory inode to maintain access time. This should not adversely impact the transaction processing rate of the namenode. Other types of transactions (e.g. file creation) will anyway cause the transaction-log-buffer to get synced to disk pretty quickly. This implementation will not distinguish between different kind of metadata accesses and is primarily targeted to weed out files that are not used for a long long time. Most tests seem to indicate amount of data synced matters as well (along with number of syncs). I will be surprised if a benchmark that tests mixed load (say 10% writes, 90% reads) is not impacted.
+1.
this looks good enough. i am wondering if u can also log the username somewhere (either the edit log or the namenode log). we are potentially interested in userid's reading specific parts of the file system (and then maintaining last reader information at the directory level). so this may allow us to tail the appropriate log and pick such information up. Doesn't
I agree that
I am interested in making some form of archival store in HDFS. Files that are not used for a long time can automatically be moved to slower and/or denser storage. Given the rate at which a cluster size increases, and given the fact that the cost to store data for infinitely long time is very low, it makes sense for the file system to make intelligent storage decisions based on how/when data was accessed. This argues for "access time" to be stored in the file system itself.
I think this proposal is in the right direction.
According to Which means that if we let every open / getBlockLocation be logged and flushed we loose big. Another observation is that map-reduce does a lot of ls operations both for directories and individual files. I have seen 20,000 per second. This is done when the job starts and depends on the user input data and on how many tasks should the job be running. So may be we should not log file access for ls, permission checking, etc. I think it would be sufficient to write OP_SET_ACCESSTIME only in case of getBlockLocations(). Also I think we should not support access time for directories only for regular files. Another alternative would be to keep the access time only in the name-node memory. Would that be sufficient enough to detect "malicious" It would be good to have some experimental data measuring throughput and latency for getBlockLocation with and without ACCESSTIME I agree with Konstantin on most counts. I will probably implement access times for files and directories for getblocklocations RPC only and then run NNThroughputBenchmark to determine its impact.
Did you mean files only, directories don't have getBlockLocations()?
Yes, Konstantin, I meant "access time for files for the getblocklocations RPC only".
Regarding Joydeep's requirement about recording the user-name of last access,, I agree with Raghu that it is more likely a case for If and only if increase in EditsLog data noticeably affects performance (which I suspect it will) : Couple of options :
Hi Raghu, the idea you propose "update access times olce every 24 hours" sounds good. However, how will the namenode remember which inodes are dirty and need to be flushed? It can keep a data structure to record all inodes that have "dirty" access times, but it needs memory. Another option would be to "walk" the entire tree looking for "dirty" inodes. Both approaches are not very appealing. Do you have any other options?
> However, how will the namenode remember which inodes are dirty and need to be flushed?
It does not need any more structures than what you have proposed ("Add a 4 byte field to the in-memory inode to maintain access time."). So at each access, you check if this 4-byte field is older than "accuracy setting", then you add a EditLog entry. If you want to keep the in-memory accurately (which I don't think is required), then you need to add 4 bytes more to record "last logged time". Right? Until we get volumes or the equivalent, it would be good to have the accuracy setting take a path. For example, I might want to have a more accurate setting for data outside a user's home directory.
Just to clarify once again, the first option above does not require any more memory or tree traversals. It just reduces the number of entries to EditsLog. Its actually just a tweak.
Raghu, just to make sure I understand right: let's say the accuray setting in 24 hours. Now, suppose I read the contents of a file /tmp/foo now at 1PM. The in-memory inode is updated with the accesstime of 1PM. But it is not recorded in the transaction log. let's assue, that no other files are accessed in the file system for the entire next day.
When it is 1 PM tomorrow, the system has to remember that /tmp/foo needs to be flushed. How does this occur? How does hdfs find out that the inode /tmp/foo is dirty and has to be flushed to the transaction log? > The in-memory inode is updated with the accesstime of 1PM. But it is not recorded in the transaction log.
It is recorded in the transaction log at this time (assuming it was not accessed in 24 hours prior to that). > When it is 1 PM tomorrow, the system has to remember that /tmp/foo needs to be flushed. How does this occur? [...] It does not need to remember, since the transaction was written at 1 PM previous day. I am trying to see if I am missing something here. Note that effect of not sync-ing the editslog file for each access is same as before. IOW, a last access time of 't' returned by NameNode for file implies "this file was last accessed during [t, t+24h)".
> IOW, a last access time of 't' returned by NameNode for file implies "this file was last accessed during [t, t+24h)".
Basically, it is recording "access date" for the case above. BTW, is it expensive to invoke System.currentTimeMillis()? I don't have any idea. I got it Raghu. Thanks for the tip. I like your proposal. +1.
This patch does the following:
1. Implements access time for files. Directories do not have access times. > 2. The access times of files is precise upto an hour boundary
Can this be made configurable? If not now, I am 100% certain it will be made configurable in near future. I don't see any advantage to not making this a config variable (but probably there is one). In fact we might have a feature request to make it runtime-configurable.. but not required in this jira. A few preliminary comments.
Incorporated most review comments. I do not update the in-memory access time every time. The in-memory access time is in sync with the value persisted on disk. Otherwise, the access time of a file could move back in time when a namenode restarts!
I also ran benchmarks with NNThroughputBenchmark. All benchmarks remain at practically the same performance. In particular, the "open benchmark with 300 threads and 100K files" is as follows: patch trunk -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12388838/accessTime4.patch against trunk revision 689230. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. -1 javac. The applied patch generated 484 javac compiler warnings (more than the trunk's current 480 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3112/testReport/ This message is automatically generated.
Address review comments.
Incorporated all of Konstantin's review comments other than the one that said that setAccessTime() call should be removed.
The setAccessTime() call is a utility method that allows an application to set the access time of a file without having to "open" the file. The permission-access-checks are precisely the same as that for "opening" a file, so there isn't any security concern IMO. Most file systems support setting access times/modification times on a file, see http://linux.die.net/man/2/utimes I purposely did not add a command to the FsShell to display access times. An application can fetch the accessTime using a programmatic API FileSystem.getFileStatus(). Mistakenly closed issue, re-opening.
I don't understand. There must be some secret use case that you don't want to talk about or something. I have all those questions
I guess I am saying I am ok with a touchAC() method (as in touch -ac), but it is already there, called getBlockLocations(), and I don't see why you need more. I can see where the capability to set access time would be extremely useful for the FUSE case. (and the NFSv4 proxy case, should that come to pass)
+1 it would be nice as "touch /export/hdfs/foo" seems to be a common way of checking fuse-dfs and the returned IO Error is confusing since the module is working but just can't implement touch. I guess it could be a no op but when access time is configured, it would be nice to have. > it would be nice as "touch /export/hdfs/foo"
Isn't touch just 'append(f); close(f)'? That what command line touch seems to do. Edit: I am just talking about touch. no strong opinion about w.r.t. setModificationTime() and setAccessTime() . I dunno about touch, but I was thinking of tar, etc case where atime can be restored as part of the unpack operation.
in fuse-dfs, this is the error I get:
touch /export/hdfs/user/pwyckoff/foo touch: setting times of `/export/hdfs/user/pwyckoff/foo': Function not implemented So, I assumed one needs to implement some attribute setting function. But, this is against 0.17 so appends also give me an IO Error. touch() is getBlockLocations() in terms of hdfs.
Preserving file times while copying or tar-ing is useful, no doubt about it, but is rather different from being able to set it to an arbitrary value. setAccessTime() gives more than what is needed. Hi Konstantin, I am not fixated on providing the setAcessTime method. However, I think it is a powerful feature that can be used by any hierarchical storage system. It even has precedence on all Unix and Windows systems :
http://linux.die.net/man/2/utimes If more people feel strongly about not having the setAccessTime API, I will remove it from the patch. +1 for setAccessTime() in HDFS.
-1 for adding it to FileSystem API. As we move towards stable APIs we really need to be more cautious about adding new APIs. Once there is enough utility demonstrated by this API in HDFS, it can move up (perhaps in a different form). regd touch, its main function is to change modification time (and create the file if it does not exist). > regd touch, its main function is to change modification time
i.e. getBlockLocations() does not seem sufficient or correct. -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389045/accessTime5.patch against trunk revision 689666. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 10 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3132/testReport/ This message is automatically generated. This is what I get from the discussion above.
In the spirit of minimizing impact on external APIs I propose to
We can always introduce setAcessTime() when it will really be necessary. Because once introduced its hard to take it back. +1 to not unduly adding stuff to the API, but when the thing in mind is something needed to support a posix API, shouldn't there be an exception? We already know as Allen points out that setAccessTime is needed to implement tar properly, so isn't that an important enough application?
I still vote for having FileSystem.setAccessTime(). It is very much like Posix and I do not see a security issue anywhere at all. Also, most archiving/unarchiving utilities sets the access time. For example, an restore utility will need to set the access time of the file.
As I said before setAccessTime has wider semantics than it is required for tar .
setAccessTime lets you set an arbitrary value to aTime, while in tar you just need to replicate the time of an existing file. On the other hand setting only access time is not sufficient for tar. You also want to keep the same modification time, owner, group and permissions. So to me it makes more sense to introduce a new create method with FileStatus of the existing file as a parameter or a copy method with an option to replicate FileStatus fields. Posix is not an argument in connection with hdfs.
I always find it hard to argue with my kind when after all pros and cons they say " I still want it". What restore utility? If we want to provide POSIX like functionality when ever possible (which I think is good idea), it would make more sense to keep the names/parameter etc similar as well. Though not part of this jira, I am +1 for FileSystem.utimes() and -1 for FileSystem.setAccessTime().
Sorry I created a confusion. When I talked about touch() I meant changing access time to current, but Raghu just corrected me that touch() in posix it can also change modification time. Just to clarify I did not want to say anything about modification time, everything I talked about was access times. Precisely, I meant
touch -ac foo I'll replace touch() with touchAC() in my earlier posts to clarify that if anybody will be curious enough to read it. Maybe this is a dumb question but how will hadoop archives (htars?) interact with access times?
Also, at least to me, an API called touchAC() seems very non-obvious as to its purpose. (esp if I'm doing this on Windows) yes, I don't think we should have a touchAC()... touch would be a utility that could be implemented with other FileSystem API.
I like Raghu's proposal that FileSystem.setAccessTime() can be renamed as FileSystem.utimes(FileStatus). But it creates some other issues:
1. The FileStatus object has blockSize of the file. The blockSize cannot be changed. Similarly, the FileStatus object has a field called 'isdir". What happens to this one? 2. Similarly, the FileStatus has the length of the file. Are we going to truncate the file (or create a sparse file with holes if the user sets a longer length)? 3. There are existing APIs FileSystem.setReplication(), FileSystem.setOwner(), setGroup(), setPermissions(). etc. Will these be deprecated or coexist with the new API? I prefer adding a setAccessTime because it allows an application to set the access time to an arbitrary value. If we want to merge all the above APIs into FileSystem.utimes(), I can do it as part of a separate JIRA. Raghu, Konstanin: does it sound ok? I don't know how FileStatus, blockSize etc matters. I would have thought it would be something like FileSystem.utimes(path, modTime, aTime) and keep the behaviour as close to as possible to posix man page, just like FileSystem.setAccessTime() interface you added.
I suggested the name utimes() since you gave utimes() as the justification. If you think setAccessTime() should be the name, that s alright I guess. The API stuff could better done in a different jira IMHO. Edit : minor Hi Raghu, Thanks. I like your proposal of having
FileSystem.utimes(path, modTime, aTime). I can do this as part of this JIRA. Further cleanup of the API can be done as part of another JIRA. Do we have consensus now? +1 for me.. in this jira or a different one, does not matter. sticking to posix-like when possible helps since that interface has already gone through the arguments..
Actually I wanted to edit my comment make it clear that I am not too opposed to setAccessTime()... > setAccessTime because it allows an application to set the access time to an arbitrary value.
This is exactly the reason why I am against introducing setAccessTime. No arbitrary value to access time. And therefore no utimes() for me. Ok, from Raghu's and Konstantin's comments, I guess nobody is stuck on how the API looks like. Whether it is setAccessTime() or utimes(), everybody is ok with it. It appears that both Raghu and Konstantin are +1 on this one. Please let me know if this is not the case.
The point that is being discussed is whether utimes/setAccessTime allows setting the time to any user-specified value or whether it sets it to current time on namenode. I still vote for allowing an user to set any access time... this is what POSIX does and it allows restore utilities to use a standard API. From Raghu's comments, it appears to me that he is +1 on it too. >So to me it makes more sense to introduce a new create method with FileStatus of the existing file as a parameter or a I do not like the idea of having a custom API as described above. Clarifying.
> setAccessTime() or utimes(), everybody is ok with it. It appears that both Raghu and Konstantin are +1 on this one. -1 on both. > I still vote for allowing an user to set any access time. Seriously, it's like nobody is listening to others. May be we need a meeting. Just to add noise to the fire, I'm +1 to setAccessTime. I also think it is a very good idea to be able to configure it off at the namenode. My case for setAccessTime is that if you expand an archive or do distcp, it is really nice to be able to optionally set all of the times to match the copied files. That includes access time.
I think Konstantin's point is that the only "realistic" use case we can come up with for setting an arbitrary access time is during a file create (such as expansion of an archive and a distcp), therefore the API should reflect that use case since it keeps the number of routines small. Please correct me if I'm wrong.
To me, that sounds incredibly unclean. It would better to allow for a separate utimes()-equivalent API so that a) if there is a use case later, it can be covered and b) do we really want a call that is two operations-in-one? > a file create (such as expansion of an archive and a distcp), [...]
During create is not enough even for these use cases. Say distcp copies 10GB file and sets Mod time at create time (to t - 1month), and the last block is written 1 min later.. then the mod time after Distcp will be (t + 1min) rather than (t - 1month). -1 for extra options to create, or close, etc. Why not just provide utimes().. since we are using POSIX as a tie breaker? Another Konstantin's point is that FS should not allow setting future time.. which sounds ok.. but it is just a file attribute to help users not something filesystem inherently depends upon. I don't see need to police it that much .. and since POSIX is a tie breaker we could just stick to it functionality. Note that all the use cases we need to be able to set modtime too. Given that Raghu, Owen and Allen commented that it is better to follow the POSIX semantics of allowing an user to set either access time or modification time to any arbitrary value he/she likes, I change my earlier patch sightly to add the following API:
This is precisely similar to the POSIX utimes call, but follows the Hadoop naming pattern for method names. This allows setting access time or modification time or both. Submitting patch for HadoopQA tests.
What are the permissions required for setting arbitrary accessTime? just read permission does not seem enough at least on Linux box.
From my understanding, (and this is what I have implemented in this patch), a read access is required to be able to set access time on files. A write-access is required to be able to set modification time on files.
That is mostly not correct.. may be it needs to be changed later.
Another option would be to allow changing access times and modifications times by the owner of the file and the superuser. But this patch does not do this. This patch "a read access is required to be able to set access time on files. A write-access is required to be able to set modification time on files".
The main use cases are distcp, restore (or untar).
Konstantine raises 2 good points:
While the extended create operation works for our use case, there are few advantages to the utimes() approach:
Hence I am in favour of: Edit : its fine. getBlockLocations calls internal dir.setTimes().
Regd setTimes() implementation : We should have a private setTimes that does not do security checks and audit logging since most common use is internal (as in getBlockLocations()) . Security checks and logging is needed only when user actively invokes setTimes().. btw, should it be setUTimes()? I haven't looked at rest of the patch thoroughly. Hi Raghu, thanks for reviewing this patch. the current patch does not do any adsitional security checks or audit logging while settign access times when invoked from getBlockLocations. In the case when FileSystem.setTimes() is called, it checks access priviledges and does audit logging. So, it behaves precisely the way you described in your comment. Please let me know if you have any additional comments.
Hadoop QA did not pick up this patch for tests. Resubmitting....
Hadoop QA did not pick up this patch for tests. Resubmitting....
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389459/accessTime6.patch against trunk revision 692287. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 10 new Findbugs warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3181/testReport/ This message is automatically generated. I created
The failed test is TestKosmosFileSystem and it has been failing for the last 5 builds. This failure is not part of this patch. The findbugs warnings are not introduced by this patch. I believe that the test-patch process is getting confused while diffing the findbugs outout on trunk with the findbugs output from this patch. Konstantin has reviewed this patch earlier. Please let me know if somebody else wants to review this patch. I would to get it commited by Friday Sept 5 so that it can make it into the 0.19 release. Dhruba, I looked at setTimes() mainly it looks good. Since the rest of the patch hasn't changed, you can commit it.
Thanks Raghu. I will commit this patch.
Integrated in Hadoop-trunk #595 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/595/
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Also, in HDFS, would access times really be that expensive? We have relatively few files and relatively many blocks. So increasing the data structure size of a file shouldn't be that costly. The larger expense might be logging each time a file is opened. How bad would that be? Perhaps we could make it optional?
I'm just brainstorming...