Hadoop Common
  1. Hadoop Common
  2. HADOOP-341

Enhance distcp to handle *http* as a 'source protocol'.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: util
    • Labels:
      None

      Description

      Requirements:

      Presently distcp recursively copies a directory from one dfs to another i.e. both source and destination of of the dfs protocol.
      Enhance it to handle http as the source protocol i.e. support copying files from arbitrary http-based sources into the dfs.

      Design:

      Follow distcp's current design: one map task per file which needs to be copied.

      Caveat: distcp handles recursive copying by listing sub-directories; this is not as feasible with a http-based source since things like 'fancy-indexing' might not be enabled on the web-server (for all sub-locations recursively too), and even if it is enabled it will mean tedious parsing of the html served to glean the sub-directories etc. Hence the idea is to support an input file (via a -f option) which contains a list of the http-based urls which represent multiple source files.

      1. distcp_input_uri.patch
        6 kB
        Arun C Murthy
      2. distcp.patch
        35 kB
        Arun C Murthy
      3. distcp2.patch
        31 kB
        Arun C Murthy

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks!

          Show
          Doug Cutting added a comment - I just committed this. Thanks!
          Hide
          Arun C Murthy added a comment -

          Here's a patch to let distcp take an input uri via the '-f' option. The uri can be of dfs/http/file schemes.
          The tool will then hit that uri to fetch a list of log-files to be copied over.

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Here's a patch to let distcp take an input uri via the '-f' option. The uri can be of dfs/http/file schemes. The tool will then hit that uri to fetch a list of log-files to be copied over. thanks, Arun
          Hide
          Arun C Murthy added a comment -

          I have a further enhancement to distcp i.e. -f option now works with urls of scheme http/dfs/file. Hence I'm reopening this issue and will submit another patch shortly.

          Doug, I'll also update logalyzer (HADOOP-342) to reflect these changes and another patch there will be needed too, please hold off commits there.

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - I have a further enhancement to distcp i.e. -f option now works with urls of scheme http/dfs/file. Hence I'm reopening this issue and will submit another patch shortly. Doug, I'll also update logalyzer ( HADOOP-342 ) to reflect these changes and another patch there will be needed too, please hold off commits there. thanks, Arun
          Hide
          Doug Cutting added a comment -

          I just committed this. Thanks, Arun.

          Show
          Doug Cutting added a comment - I just committed this. Thanks, Arun.
          Hide
          Arun C Murthy added a comment -

          Verified patch for the TestCopyFiles junit test.

          thanks,
          Arun

          Show
          Arun C Murthy added a comment - Verified patch for the TestCopyFiles junit test. thanks, Arun
          Hide
          Doug Cutting added a comment -

          The TestCopyFiles unit test fails after I apply this patch.

          Show
          Doug Cutting added a comment - The TestCopyFiles unit test fails after I apply this patch.
          Hide
          Arun C Murthy added a comment -

          Forgot to add: the above patch (distcp.patch) does a significant refactoring of CopyFiles.java by providing a base CopyFilesMapper class which is subclassed in DFSCopyFilesMapper (which contains Milind's existing code) and HttpCopyFilesMapper (for http-based sources). In future we can add other protocols (ftp?) by creating new subclasses (FtpCopyFilesMapper).

          thanks,
          Arun

          PS: Apologies for the extra spam.

          Show
          Arun C Murthy added a comment - Forgot to add: the above patch (distcp.patch) does a significant refactoring of CopyFiles.java by providing a base CopyFilesMapper class which is subclassed in DFSCopyFilesMapper (which contains Milind's existing code) and HttpCopyFilesMapper (for http-based sources). In future we can add other protocols (ftp?) by creating new subclasses (FtpCopyFilesMapper). thanks, Arun PS: Apologies for the extra spam.
          Hide
          Arun C Murthy added a comment -

          Here's a patch which enables distcp to work with http based source files.

          It also provides a -f option to distcp to provide an input file with urls (arbitrary combinations of dfs & http source paths).

          Show
          Arun C Murthy added a comment - Here's a patch which enables distcp to work with http based source files. It also provides a -f option to distcp to provide an input file with urls (arbitrary combinations of dfs & http source paths).

            People

            • Assignee:
              Arun C Murthy
              Reporter:
              Arun C Murthy
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development