|
Please factor out the directory content parsing into a separate method so that it can be replaced in subclasses.
> factor out the directory content parsing into a separate method so that it can be replaced in subclasses.
Sure, that'd be easy, if we need to subclass. Do we really? Are there required features that cannot be supported by delivering HTML? Will those features be guaranteed not to change from version-to-version, potentially compromising bi-directional compatibility? I think we should implement a servlet that:
1. Considers everything after the HttpServletRequest#getContextPath() as a path. 2. If it names an HDFS file, set attributes as HTTP headers and, if the request is HEAD return an empty page, if GET, return the content, otherwise return an error. 3. If it's a HEAD or GET of a non-slash-terminated directory, redirect to the slash-terminated directory. 4. If it's a HEAD or GET of a slash-terminated directory name, set attributes and, if GET, return HTML containing links to that directory's files; 5. Otherwise return an error. Then we should try to use this as a source for MapReduce and distcp and see how it fares. The HTTP client may need to be replaced, file status may need to be cached, etc. But this simple approach will get us up and going, and avoid investing too much time designing a schema, parsing XML, etc. when that may not be required. Thoughts? A couple of thoughts:
1. If, for performance, we find we must cache FileStatus in most FileSystem#listPaths implementations, then that means the FileSystem API is inappropriate. In this case, we should replace FileSystem#listPaths() and #getFileStatus() with a single new method: public abstract Map<Path,FileStatus> listStatus(Path path) throws IOException; 2. If, in HttpFileSystem, we find that (e.g., in order to efficiently support #listStatus) an HTML-based implementation is insufficient for HDFS, then we should not implement other directory formats by subclassing. Rather HttpFileSystem should use plugins for various formats. That fits the existing FileSystem extension mechanism better, which dispatches on protocol only. The plugin interface might look like: public interface HttpFileServer { HttpFileSystem would pick an HttpFileServer implementation based hostname, content type or something. Content-type would be elegant, but probably insufficient, since, e.g., S3 returns a content-type of application/xml. Hostname would require reconfiguration for each site. Perhaps we can use the "Server" header. That would work for S3, and we could set it for HDFS.
+1 A couple of points regarding the patch: In HttpFileSystem#initialize the name variable is set to itself, so it's always null. By removing getDefaultBlockSize() in S3FileSystem the property "fs.s3.block.size" is removed (but it's still in hadoop-default.xml). This looks like a change that was made earlier in the checksumming work, so is probably fine in the context of this patch. Finally, some unit tests would be good. Otherwise, it looks good. Should HTML scraping prove inadequate, WebDav might be useful for this. Its PROPFIND method permits directory enumeration.
This fixes the 'name = name' issue Tom pointed out, and permits file lengths longer than 2^31. I agree that this needs unit tests before it can be committed. I'd also like to first implement a servlet for HDFS to test that performance is acceptable.
I don't see an easy way to handle S3 with this, exposing it as a hierarchical space of slash-delimited directories, except perhaps to write a servlet that proxies directory listings and redirects for file content.
The proxy idea sounds good - the servlet pseudo code would be something like:
if path is not slash-terminated
if HEAD S3 path is successful
redirect to S3 resource at path
else
redirect to path/
else
GET S3 bucket with prefix = path, delimiter = /
if bucket is empty
return 404
else
return bucket contents as XHTML
(Of course, the work to do this would go in a new Jira issue.) I talked with Owen about this, and what he wants is more like a 'tar' format for the FileSystem API, something that preserves standard properties, without being specific to the FileSystem implementation. The URI for this should be something like hftp://host:port/a/b/c, since, while HTTP will be used as the transport, this will not be a FileSystem for arbitrary HTTP urls. Finally, we agreed that the FileSystem API should be altered, so that listStatus() is the primary method, replacing both listPaths() and getStatus(). Whether or not my HttpFileSystem (included above) is in fact ever used, that patch also has some cleanups to the FileSystem API that should be committed.
I moved the FileSystem API cleanups from the patch here to
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This assumes that:
This seems to work for directory listings produced by Apache, Tomcat, Jetty, and Subversion. If we make HDFS browsable over HTTP in the above manner, then this will work for HDFS too.
I've also added default definitions for a bunch of abstract FileSystem methods, and removed definitions in implementations that matched these default definitions, simplifying most FileSystem implementations.