Design proposal to expose filecache access to Hadoop streaming.
The main difference with the pure-Java filecache code is:
1. As part of job launch (in hadoopStreaming client) we validate presence of
cached archives/files in DFS.
2. As part of Task initialization, a symbolic link to cached files/unarchived
directories is created in the Task working directory.
C1. New command-line options (example)
This maps to API calls to static methods:
DistributedCache.addCacheArchive(URI uri, Configuration conf)
DistributedCache.addCacheFile(URI uri, Configuration conf)
This is done in class StreamJob methods parseArgv() and setJobConf().
The code should be similar to the way "-file" is handled.
One difference is that we now require a FileSystem instance to VALIDATE the DFS
paths in -cachefile and -cachearchive. The FileSystem instance should not be
accessed before the filesystem is set by this: setUserJobConfProps(true);
If FileSystem instance is "local" and there are -cachearchive/-cachefile
options , then fail: this is not supported.
Else this should return true:
fs_.isFile(Path) for each -cachearchive/-cachefile option.
Only in verbose mode: show the isFile status of each option.
In any verbosity mode: show the first failed isFile() status and abort using
C2. Task initialization
The symlinks are called:
Workingdir/big_1 (points to directory: /cache/user/me/big_zip)
Workingdir/big_2 (points to file: /cache/user/other/big.zip)
Workingdir/bang.zip (points to directory /cache/user/me/bang_zip)
This will require hadoopStreaming to create symbolic links.
Hadoop should have code to do this in a portable way.
Although this may not be supported on non-Unix platforms.
Cross-platform support is harder than for hard-links.
Cygwin soft links are not a solution: they only work for applications compiled with
Symbolic links make JUnit tests less portable.
So maybe the test should run as part of ant target test-unix. (in contrib/streaming/build.xml)
The parameters after -cachearchive and -cachefile have the following
A. you can optionally give a name to your symlink (after #)
B. the default name is the leaf name (big.zip, big.zip, bang.zip)
C. if the same leaf name appears more than once you MUST give a name. Otherwise
streaming client aborts and complains. For example with this, Streaming client
This complains because multiple occurrences of "big.zip" are not disambiguated
with #big_1, #big_2.
Ideally the Streaming client error message should then generate an example on
how to fix the parameters:
Currently argv parsing is position-independant. i.e. changing the order of
arguments never impacts the behaviour of hadoopStreaming. It would be good to
keep this behaviour.
scheme is "dfs:" for consistency with current state of Hadoop code.
However there is a proposal to change the scheme to "hdfs:"
Using a URI fragment to give a local name to the resource is unusual. The main
constraint is that the URI should remain parsable by java.net.URI(String). And
encoding attributes in the fragment is standard (like CGI parameters in an HTTP
GET request) (fragment is #big2 in dfs:/user/other/big.zip#big_2)