Uploaded a first draft of a design doc (followed the template in HADOOP-5587). Content follows. Lemme know where I missed the boat. Thanks, Eli
Symlinks Design Doc
HDFS path resolution has the following limitations:
- Files and directories are only accessible via a single path.
- An HDFS namespace may not span multiple file systems.
Symbolic links address these limitations by providing an additional level of indirection when resolving paths in an HDFS file system. A symbolic link is a special type of file that contain a path to another file or directory. Paths may be relative or absolute. Relative paths (eg ../user) provide an alternate path to a single file or directory in the file system. An absolute path may be relative to the current file system (eg /user) or specify a URL (eg hdfs://localhost:8020/foo) which allows the link to point to any file or directory irrespective of the source and destination file systems.
Allowing multiple paths to resolve to the same file or directory, and HDFS namespaces to span multiple file systems makes it easier to access files and manage their underlying stoarge.
If an application requires data be available by a particular path a symlink may be used in lieu of copying the data from its current location. For example, a user may want to create a symlink /data/latest that points to an existing directory so that the latest data is accessible via it's current name and an alias, eg:
$ hadoop fs -ln /data/20090922 /data/latest
The user may eventually want to archive this data so that it's accessible but stored more efficiently. They could create an archive of the files in the 20090922 directory and make the original path a symlink to the HAR, eg:
$ hadoop fs -ln har:///data/20090922.har /data/20090922
They could also move the directory to another file system that is perhaps lightly loaded or less expensive and make the existing directory a symlink, eg:
$ hadoop fs -ln hdfs://archive-host/data/20090922 /data/20090922
The archival file system could also be accessible via an alternative protocol (eg FTP). In both cases the original data has moved but remains accessible by its original path.
This technique can be used generally to balance storage by transparently making a namespace span multiple file systems. For example, if a particular subtree of a namespace outgrows the capabilities of the file system it resides on (eg Namenode performance, number of files, etc) it it can be moved to a new file system and linked into its current path.
A symbolic link is also a useful primitive that could be used to implement atomic rename within a file system by atomically rewriting a symbolic link, or to rename a file across partitions (if in the future a file system metadata is partitioned across multiple hosts). See
HADOOP-6240 for more info.
Interaction with the Current System
The user may interact with symbolic links via the shell or indirectly via applications (eg libhdfs clients like fuse mounts).
Symbolic links are transparent to most operations though some may want to handle links specially. In general, the behavior should match POSIX where appropriate. Note that linking across file systems is somehwat equivalent to creating a symlink across mount points in POSIX.
- Some commands operate on the link directly (eg stat, rm) if the link is the target.
- Some commands (eg mv) should operate on the link target if a trailing slash is used (eg if /bar is a link that points to a directory, mv /foo /bar renames bar to foo while mv /foo /bar moves /foo into the directory pointed to by bar).
- Symbolic links in archive URIs should fully resolve.
- Some APIs should operate on the link target (eg setting access and modification times).
Permissions: access control properties of links are ignored, checks are always performed against the link target (if it resides in an HDFS file system). The ch* operations should operate directly on the target.
Some utilities need to be link-aware:
- distcp should not follow links by default.
- fsck should only look at each file once, could optionally report dangling links.
- Symbolic links in HARs should be followed (so that a symlink to a HAR preserves the original path resolution behavior).
Symbolic links exist independently of their targets, and may point to non-existing files or directories.
Clients send the entire path to the Namenode for resolution. The path may contain multiple links:
- If all links in the path are relative to the current file system then the Namenode transparently (to the client) resolves the path.
- If the Namenode finds a link in the path that points outside the file system it must provide API(s) to report to the client that (a) it can not resolve a path (due to a link that points outside the file system) and (b) return the target of the link and the remainder of the path that still needs to be resolved.
Symbolic links should be largely invisible to users of the client. However, symbolic links may introduce cycles into path resolution. For example, a link may point to another URI (on another file system) which points back to the link. Loops should be avoided by having the client limit the number of links it will traverse, and report to its user that the operation was not successful.
Symbolic links should not introduce significant overhead in the common case, resolving paths without links. Resolving symbolic links may be a frequent operation. If links are being used to transparently span a namespace across multiple file systems then the "root" Namenode may do little work aside from resolving link paths. Therefore, link resolution should have reasonable performance overhead and limited side-effects on both the client and Namenode.
One approach is to have the client first stat the file to see if it is a symbolic link and resolve the path. Once the path is fully resolved (this may require contacting additional file systems) the client performs the desired operation. This introduces an additional RPC in the common case (when links are not present) so an optimization is to optimistically perform the operation first. If it was unsucessful due to the presence of an external link the Namenode notifies the client, using an exception, which is caught at the FileSystem level (since the link refers to another file system). The client then makes an additional call to the Namenode to resolve the link.* Once the path is fully resolved the operation can be re-tried. Note that if the operation contained multiple paths each may contain links which must be fully resolved before the operation can complete. This may require additional round trips. The call to resolve a link may fail if the link has been deleted, or is no longer a link, etc. In this case the resulting exception is passed upwards as if the original operation failed. As stated above, link resolution at the FileSystem level will perform a limited number of link resolutions before notifying the client of the failure.
The Namenode's INodeFile class needs maintain additional metadata to indicate whether a file is a symbolic link (or an additional class could be introduced).
* NB: It is possible to eliminate this additional RPC by piggy backing the link resolution on the notification.
This design should cover the necessary use cases however some of the above features (like enhancing archives) may be deferred.
Related future work is implementing atomic rename within a file system by atomically re-writing a symbolic link.