Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-576

Enhance streaming to use the new caching feature

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • None
    • None
    • None

    Description

      Design proposal to expose filecache access to Hadoop streaming.

      The main difference with the pure-Java filecache code is:
      1. As part of job launch (in hadoopStreaming client) we validate presence of
      cached archives/files in DFS.
      2. As part of Task initialization, a symbolic link to cached files/unarchived
      directories is created in the Task working directory.

      C1. New command-line options (example)
      -cachearchive dfs:/user/me/big.zip#big_1
      -cachefile dfs:/user/other/big.zip#big_2
      -cachearchive dfs:/user/me/bang.zip

      This maps to API calls to static methods:
      DistributedCache.addCacheArchive(URI uri, Configuration conf)
      DistributedCache.addCacheFile(URI uri, Configuration conf)
      This is done in class StreamJob methods parseArgv() and setJobConf().
      The code should be similar to the way "-file" is handled.

      One difference is that we now require a FileSystem instance to VALIDATE the DFS
      paths in -cachefile and -cachearchive. The FileSystem instance should not be
      accessed before the filesystem is set by this: setUserJobConfProps(true);

      If FileSystem instance is "local" and there are -cachearchive/-cachefile
      options , then fail: this is not supported.

      Else this should return true:
      fs_.isFile(Path) for each -cachearchive/-cachefile option.
      Only in verbose mode: show the isFile status of each option.
      In any verbosity mode: show the first failed isFile() status and abort using
      method StreamJob.fail().

      C2. Task initialization
      The symlinks are called:
      Workingdir/big_1 (points to directory: /cache/user/me/big_zip)
      Workingdir/big_2 (points to file: /cache/user/other/big.zip)
      Workingdir/bang.zip (points to directory /cache/user/me/bang_zip)

      This will require hadoopStreaming to create symbolic links.
      Hadoop should have code to do this in a portable way.
      Although this may not be supported on non-Unix platforms.
      Cross-platform support is harder than for hard-links.
      Cygwin soft links are not a solution: they only work for applications compiled with
      cygwin1.dll)
      Symbolic links make JUnit tests less portable.
      So maybe the test should run as part of ant target test-unix. (in contrib/streaming/build.xml)

      The parameters after -cachearchive and -cachefile have the following
      properties:

      A. you can optionally give a name to your symlink (after #)
      B. the default name is the leaf name (big.zip, big.zip, bang.zip)
      C. if the same leaf name appears more than once you MUST give a name. Otherwise
      streaming client aborts and complains. For example with this, Streaming client
      should complain:
      -cachearchive dfs:/user/me/big.zip
      -cachefile dfs:/user/other/big.zip
      This complains because multiple occurrences of "big.zip" are not disambiguated
      with #big_1, #big_2.
      Ideally the Streaming client error message should then generate an example on
      how to fix the parameters:
      -cachearchive dfs:/user/me/big.zip#1
      -cachefile dfs:/user/other/big.zip#2

      ---------

      hadoop-Client note:
      Currently argv parsing is position-independant. i.e. changing the order of
      arguments never impacts the behaviour of hadoopStreaming. It would be good to
      keep this behaviour.

      URI notes:
      scheme is "dfs:" for consistency with current state of Hadoop code.
      However there is a proposal to change the scheme to "hdfs:"

      Using a URI fragment to give a local name to the resource is unusual. The main
      constraint is that the URI should remain parsable by java.net.URI(String). And
      encoding attributes in the fragment is standard (like CGI parameters in an HTTP
      GET request) (fragment is #big2 in dfs:/user/other/big.zip#big_2)

      Attachments

        1. streaming.patch
          28 kB
          Mahadev Konar

        Activity

          People

            mahadev Mahadev Konar
            michel_tourn Michel Tourn
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: