The code currently works as follows.
- HftpFileSystem::open(path, bufferSize) issues a GET request to, e.g., http://namenode/data/path
- On the namenode, /data/path is handled by FileDataServlet. FileDataServlet chooses a datanode (using JspHelper.bestNode) and issues an http redirect response to the datanode (e.g., http://datanode/streamFile?filename=path&... )
- /streamFile?filename=path is called on the data node, which is handled by org.apache.hadoop.hdfs.server.namenode.StreamFile. StreamFIle creates a DFSClient and serves the appropriate file.
To handle range requests, the following can be done:
- Modify /streamFile to handle range requests
- Modify the way FileDataServlet chooses a datanode (it should use the block locations in the byte-range being requested, not the block locations for the entire file)
- Add a method to HftpFileSystem that takes one or more byte range arguments (depending on the answer to the question below)
- Confirm that when HttpURLConnection follows redirects, it maintains headers. Specifically, the Range header will need to be sent to the datanode after the redirect response comes back from the namenode.
The HTTP spec supports multiple byte-ranges. This returns a multi-part (mime) request (see: http://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.2 )
This is different from a request that contains a single byte-range, which returns data in the standard format, but with an additional Content-Range header (see: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.16 )
There are three options that I see:
a) Support only a single byte-range. This makes more sense to me from an API point of view, since we can amend the following:
HftpFileSystem::open(Path f, int buffersize)
HftpFileSystem::open(Path f, int buffersize, long begin, long end)
...which would read the file f from [begin,end].
b) Support multiple byte-ranges. This would require ensuring that HttpURLConnection supports mime responses (I don't know if it does). Supporting this would also lead to a more complicated API (something like: )
HftpFileSystem::open(Path f, int buffersize, List<ByteRange> ranges)
Also, because open() returns an FSDataInputStream, supporting multiple byte-ranges would either require that reading from the FSDataInputStream would result in reading bytes from different ranges sequentially (requiring the client to figure out where bytes in the input stream begin and end) or changing open() to return a list of input streams corresponding to each byte-range.
c) We could support multiple byte-ranges in StreamFile, but only support a single byte-range in HftpFileSystem.
To parse the Range requests, I plan to use a few utility classes included in jetty. Specifically, org.mortbay.jetty.InclusiveByteRange and org.mortbay.util.MultiPartOutputStream (but the latter only if we decide to support multiple byte-ranges). Additionally, the logic used to handle the byte-ranges will be heavily inspired by org.mortbay.jetty.servlet.DefaultServlet::sendData, which is also licensed under Apache 2.