Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-341

Enhance distcp to handle *http* as a 'source protocol'.

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.5.0
    • util
    • None

    Description

      Requirements:

      Presently distcp recursively copies a directory from one dfs to another i.e. both source and destination of of the dfs protocol.
      Enhance it to handle http as the source protocol i.e. support copying files from arbitrary http-based sources into the dfs.

      Design:

      Follow distcp's current design: one map task per file which needs to be copied.

      Caveat: distcp handles recursive copying by listing sub-directories; this is not as feasible with a http-based source since things like 'fancy-indexing' might not be enabled on the web-server (for all sub-locations recursively too), and even if it is enabled it will mean tedious parsing of the html served to glean the sub-directories etc. Hence the idea is to support an input file (via a -f option) which contains a list of the http-based urls which represent multiple source files.

      Attachments

        1. distcp_input_uri.patch
          6 kB
          Arun Murthy
        2. distcp.patch
          35 kB
          Arun Murthy
        3. distcp2.patch
          31 kB
          Arun Murthy

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            acmurthy Arun Murthy
            acmurthy Arun Murthy
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment