On Sanjay's create and append:
You are correct, an HDFS proxy deployment does not need to do a redirection (to a DN); it will be handled itself by the proxy.
Still, for authentication purposes a probing should be done before attempting uploading data. Because of this the create & append requests are identical in the hdfs-proxy (hoop) and in the built-in (NN&DN http serving) modes. In the case of hdfs-proxy the probing is for auth only, in the case of built-in the probing is for both authentication and potential redirection.
This means that we can have the exact same API for both hdfs-proxy and built-in modes.
Still the use of 100-continue is an open issue, more of this at the end of this comment.
On Sanjay's comment on 'some thoughts of webhdfs & hoop':
- Support for trusted proxies (doAs functionality) it does make sense in the case of hdfs-proxy and it is already supported by Hoop. I.e. server-side apps that need/want HTTP access to HDFS and act on behalf of other users. I.e. for somebody using the Java API to access HDFS via hdfs-proxy and using a doAs block.
- Support for delegation tokens to access hdfs-proxy it does make sense. I.e. when using distcp via hdfs-proxy; in this case, delegation tokens should work across clusters (this may not be supported today but IMO it should eventually work).
- You meantion code/param/return clean up. What kind of clean up are you referring to?
On Sanjay's 'As we move forward':
- What subset of webhdfs API makes sense for a proxy? IMO, they should be identical, a user should not see a difference if they access a built-in or an hdfs-proxy HTTP setup.
- Regarding a 'pure proxy'. This would be more like a reverse proxy and then all URLs would have to be relative or resolved with knowledge of the reverse proxy. IMO, a hdfs-proxy on its own has its merits.
1* Use of 100-CONTINUE for create & append, it seems not all client HTTP libraries handle this (JDK HttpURLConnection to start). Plus the servlet API does not provide support for it, it seems some servlet containers handle it but in a way that it is non-standard (http://jira.codehaus.org/browse/JETTY-341) or in a way that it never reaches the servlet (http://stackoverflow.com/questions/848378/sending-100-continue-using-java-servlet-api). Because of this I'm inclined to use a handle request as shown in the attached API doc.
2* Are we OK with the attached API (except for the discussion on #1)?
3* Codebase, Hoop was using TestNG for testcases and non-apache package names, I've been working on refactoring to work with JUnit, to refactor package names and to organize the code in a way that fits in the current source layout. In the mean time, for webhdfs (built-in http) some code from Hoop has been cloned, modified and integrated into HDFS. This code has changed significantly, thus integrating it with Hoop will require some serious rewriting of Hoop. Giving the current timeframe we are shooting for 0.23, should we add Hoop as a separate module to have hdfs-proxy like support and later see how merge the code?