Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
None
Description
Design proposal to expose filecache access to Hadoop streaming.
The main difference with the pure-Java filecache code is:
1. As part of job launch (in hadoopStreaming client) we validate presence of
cached archives/files in DFS.
2. As part of Task initialization, a symbolic link to cached files/unarchived
directories is created in the Task working directory.
C1. New command-line options (example)
-cachearchive dfs:/user/me/big.zip#big_1
-cachefile dfs:/user/other/big.zip#big_2
-cachearchive dfs:/user/me/bang.zip
This maps to API calls to static methods:
DistributedCache.addCacheArchive(URI uri, Configuration conf)
DistributedCache.addCacheFile(URI uri, Configuration conf)
This is done in class StreamJob methods parseArgv() and setJobConf().
The code should be similar to the way "-file" is handled.
One difference is that we now require a FileSystem instance to VALIDATE the DFS
paths in -cachefile and -cachearchive. The FileSystem instance should not be
accessed before the filesystem is set by this: setUserJobConfProps(true);
If FileSystem instance is "local" and there are -cachearchive/-cachefile
options , then fail: this is not supported.
Else this should return true:
fs_.isFile(Path) for each -cachearchive/-cachefile option.
Only in verbose mode: show the isFile status of each option.
In any verbosity mode: show the first failed isFile() status and abort using
method StreamJob.fail().
C2. Task initialization
The symlinks are called:
Workingdir/big_1 (points to directory: /cache/user/me/big_zip)
Workingdir/big_2 (points to file: /cache/user/other/big.zip)
Workingdir/bang.zip (points to directory /cache/user/me/bang_zip)
This will require hadoopStreaming to create symbolic links.
Hadoop should have code to do this in a portable way.
Although this may not be supported on non-Unix platforms.
Cross-platform support is harder than for hard-links.
Cygwin soft links are not a solution: they only work for applications compiled with
cygwin1.dll)
Symbolic links make JUnit tests less portable.
So maybe the test should run as part of ant target test-unix. (in contrib/streaming/build.xml)
The parameters after -cachearchive and -cachefile have the following
properties:
A. you can optionally give a name to your symlink (after #)
B. the default name is the leaf name (big.zip, big.zip, bang.zip)
C. if the same leaf name appears more than once you MUST give a name. Otherwise
streaming client aborts and complains. For example with this, Streaming client
should complain:
-cachearchive dfs:/user/me/big.zip
-cachefile dfs:/user/other/big.zip
This complains because multiple occurrences of "big.zip" are not disambiguated
with #big_1, #big_2.
Ideally the Streaming client error message should then generate an example on
how to fix the parameters:
-cachearchive dfs:/user/me/big.zip#1
-cachefile dfs:/user/other/big.zip#2
---------
hadoop-Client note:
Currently argv parsing is position-independant. i.e. changing the order of
arguments never impacts the behaviour of hadoopStreaming. It would be good to
keep this behaviour.
URI notes:
scheme is "dfs:" for consistency with current state of Hadoop code.
However there is a proposal to change the scheme to "hdfs:"
Using a URI fragment to give a local name to the resource is unusual. The main
constraint is that the URI should remain parsable by java.net.URI(String). And
encoding attributes in the fragment is standard (like CGI parameters in an HTTP
GET request) (fragment is #big2 in dfs:/user/other/big.zip#big_2)