Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.11.0
    • Component/s: util
    • Labels:
      None

      Description

      CopyFile is a useful tool for doing bulk copies. It doesn't have handling for the recently added s3 filesystem.

      1. copyfiles-s3.diff
        16 kB
        stack
      2. copyfiles-s3-2.diff
        18 kB
        stack
      3. copyfiles-s3-3.diff
        18 kB
        stack
      4. copyfiles-s3-4.diff
        19 kB
        stack

        Activity

        stack created issue -
        stack made changes -
        Field Original Value New Value
        Attachment copyfiles-s3.diff [ 12348425 ]
        Hide
        stack added a comment -

        Attached is first cut at adding s3 handling to CopyFiles.

        Here's list of changes:

        + Allow hdfs or dfs URI schemes (Used to be dfs only).
        + Changed the usage message so filesystem is generic URI (rather than namenode:port | local).
        + getFileSysName was removed. Use Filesystem.get with fs URI instead.
        + getMapCount: Moved duplicated code for figuring number of maps here.
        + toURI: Added. Have (duplicated) tests of URIness go via here instead.
        + CopyFilesReducer: Removed two instances. Does nothing.
        + Added testing of URIness to members of file-of-source URIs.
        + Minor javadoc and formatting changes.

        Its lightly tested.

        Show
        stack added a comment - Attached is first cut at adding s3 handling to CopyFiles. Here's list of changes: + Allow hdfs or dfs URI schemes (Used to be dfs only). + Changed the usage message so filesystem is generic URI (rather than namenode:port | local). + getFileSysName was removed. Use Filesystem.get with fs URI instead. + getMapCount: Moved duplicated code for figuring number of maps here. + toURI: Added. Have (duplicated) tests of URIness go via here instead. + CopyFilesReducer: Removed two instances. Does nothing. + Added testing of URIness to members of file-of-source URIs. + Minor javadoc and formatting changes. Its lightly tested.
        stack made changes -
        Attachment copyfiles-s3-2.diff [ 12348526 ]
        Hide
        stack added a comment -

        Updated patch.

        + Renamed DFSCopyFilesMapper as FSCopyFilesMapper
        + If no scheme, use 'default' (the value of 'fs.default.name' in hadoop-site.xml).

        I ran more extensive tests going from hdfs to s3 and back again and copying from http into s3 and hdfs (distcp is a nice tool). For example, here is output from a copy of a small nutch segment from hdfs to s3 (in the below hdfs was set as the fs.default.name filesystem):

        stack@debord:~/checkouts/hadoop$ ./bin/hadoop fs -lsr outputs/segments
        /user/stack/outputs/segments/20070108213341-test <dir>
        /user/stack/outputs/segments/20070108213341-test/crawl_fetch <dir>
        /user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000 <dir>
        /user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000/data <r 1> 1187
        /user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000/index <r 1> 234
        /user/stack/outputs/segments/20070108213341-test/crawl_parse <dir>
        /user/stack/outputs/segments/20070108213341-test/crawl_parse/part-00000 <r 1> 9010
        /user/stack/outputs/segments/20070108213341-test/parse_data <dir>
        /user/stack/outputs/segments/20070108213341-test/parse_data/part-00000 <dir>
        /user/stack/outputs/segments/20070108213341-test/parse_data/part-00000/data <r 1> 4630
        /user/stack/outputs/segments/20070108213341-test/parse_data/part-00000/index <r 1> 234
        /user/stack/outputs/segments/20070108213341-test/parse_text <dir>
        /user/stack/outputs/segments/20070108213341-test/parse_text/part-00000 <dir>
        /user/stack/outputs/segments/20070108213341-test/parse_text/part-00000/data <r 1> 6180
        /user/stack/outputs/segments/20070108213341-test/parse_text/part-00000/index <r 1> 234

        Here's copy to an s3 directory named segments-bkup:

        % ./bin/hadoop distcp /user/stack/outputs/segments s3://KEY:SECRET@BUCKET/segments-bkup

        Here's listing of s3 content:

        stack@debord:~/checkouts/hadoop$ ./bin/hadoop fs -fs s3://KEY:SECRET@BUCKET/segments-bkup -lsr /segments-bkup/
        /segments-bkup/20070108213341-test <dir>
        /segments-bkup/20070108213341-test/crawl_fetch <dir>
        /segments-bkup/20070108213341-test/crawl_fetch/part-00000 <dir>
        /segments-bkup/20070108213341-test/crawl_fetch/part-00000/data <r 1> 1187
        /segments-bkup/20070108213341-test/crawl_fetch/part-00000/index <r 1> 234
        /segments-bkup/20070108213341-test/crawl_parse <dir>
        /segments-bkup/20070108213341-test/crawl_parse/part-00000 <r 1> 9010
        /segments-bkup/20070108213341-test/parse_data <dir>
        /segments-bkup/20070108213341-test/parse_data/part-00000 <dir>
        /segments-bkup/20070108213341-test/parse_data/part-00000/data <r 1> 4630
        /segments-bkup/20070108213341-test/parse_data/part-00000/index <r 1> 234
        /segments-bkup/20070108213341-test/parse_text <dir>
        /segments-bkup/20070108213341-test/parse_text/part-00000 <dir>
        /segments-bkup/20070108213341-test/parse_text/part-00000/data <r 1> 6180
        /segments-bkup/20070108213341-test/parse_text/part-00000/index <r 1> 234

        Show
        stack added a comment - Updated patch. + Renamed DFSCopyFilesMapper as FSCopyFilesMapper + If no scheme, use 'default' (the value of 'fs.default.name' in hadoop-site.xml). I ran more extensive tests going from hdfs to s3 and back again and copying from http into s3 and hdfs (distcp is a nice tool). For example, here is output from a copy of a small nutch segment from hdfs to s3 (in the below hdfs was set as the fs.default.name filesystem): stack@debord:~/checkouts/hadoop$ ./bin/hadoop fs -lsr outputs/segments /user/stack/outputs/segments/20070108213341-test <dir> /user/stack/outputs/segments/20070108213341-test/crawl_fetch <dir> /user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000 <dir> /user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000/data <r 1> 1187 /user/stack/outputs/segments/20070108213341-test/crawl_fetch/part-00000/index <r 1> 234 /user/stack/outputs/segments/20070108213341-test/crawl_parse <dir> /user/stack/outputs/segments/20070108213341-test/crawl_parse/part-00000 <r 1> 9010 /user/stack/outputs/segments/20070108213341-test/parse_data <dir> /user/stack/outputs/segments/20070108213341-test/parse_data/part-00000 <dir> /user/stack/outputs/segments/20070108213341-test/parse_data/part-00000/data <r 1> 4630 /user/stack/outputs/segments/20070108213341-test/parse_data/part-00000/index <r 1> 234 /user/stack/outputs/segments/20070108213341-test/parse_text <dir> /user/stack/outputs/segments/20070108213341-test/parse_text/part-00000 <dir> /user/stack/outputs/segments/20070108213341-test/parse_text/part-00000/data <r 1> 6180 /user/stack/outputs/segments/20070108213341-test/parse_text/part-00000/index <r 1> 234 Here's copy to an s3 directory named segments-bkup: % ./bin/hadoop distcp /user/stack/outputs/segments s3://KEY:SECRET@BUCKET/segments-bkup Here's listing of s3 content: stack@debord:~/checkouts/hadoop$ ./bin/hadoop fs -fs s3://KEY:SECRET@BUCKET/segments-bkup -lsr /segments-bkup/ /segments-bkup/20070108213341-test <dir> /segments-bkup/20070108213341-test/crawl_fetch <dir> /segments-bkup/20070108213341-test/crawl_fetch/part-00000 <dir> /segments-bkup/20070108213341-test/crawl_fetch/part-00000/data <r 1> 1187 /segments-bkup/20070108213341-test/crawl_fetch/part-00000/index <r 1> 234 /segments-bkup/20070108213341-test/crawl_parse <dir> /segments-bkup/20070108213341-test/crawl_parse/part-00000 <r 1> 9010 /segments-bkup/20070108213341-test/parse_data <dir> /segments-bkup/20070108213341-test/parse_data/part-00000 <dir> /segments-bkup/20070108213341-test/parse_data/part-00000/data <r 1> 4630 /segments-bkup/20070108213341-test/parse_data/part-00000/index <r 1> 234 /segments-bkup/20070108213341-test/parse_text <dir> /segments-bkup/20070108213341-test/parse_text/part-00000 <dir> /segments-bkup/20070108213341-test/parse_text/part-00000/data <r 1> 6180 /segments-bkup/20070108213341-test/parse_text/part-00000/index <r 1> 234
        Hide
        Tom White added a comment -

        I just tried using this patch, and I managed to copy some local files to the S3 file system without trouble.

        Looking at the code I noticed that the -fs option doesn't seem to be used any longer so it can be dropped. Other than that, it looks fine to me.

        Show
        Tom White added a comment - I just tried using this patch, and I managed to copy some local files to the S3 file system without trouble. Looking at the code I noticed that the -fs option doesn't seem to be used any longer so it can be dropped. Other than that, it looks fine to me.
        Hide
        stack added a comment -

        Fix usage string (suggested by Tom White review)

        Show
        stack added a comment - Fix usage string (suggested by Tom White review)
        stack made changes -
        Attachment copyfiles-s3-3.diff [ 12350196 ]
        Hide
        stack added a comment -

        Marking issue with 'patch available'.

        Show
        stack added a comment - Marking issue with 'patch available'.
        stack made changes -
        Affects Version/s 0.10.1 [ 12312258 ]
        Affects Version/s 0.10.0 [ 12312207 ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Fix Version/s 0.11.0 [ 12312257 ]
        Hide
        stack added a comment -

        Thanks for the review Tom.

        Show
        stack added a comment - Thanks for the review Tom.
        Hide
        Hadoop QA added a comment -

        -1, because 3 attempts failed to build and test the latest attachment (http://issues.apache.org/jira/secure/attachment/12350196/copyfiles-s3-3.diff) against trunk revision r502402. Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

        Show
        Hadoop QA added a comment - -1, because 3 attempts failed to build and test the latest attachment ( http://issues.apache.org/jira/secure/attachment/12350196/copyfiles-s3-3.diff ) against trunk revision r502402. Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.
        Hide
        stack added a comment -

        New patch to fix broken unit test. Removes 'dfs' scheme. Only 'hdfs' allowed from here on out.

        Show
        stack added a comment - New patch to fix broken unit test. Removes 'dfs' scheme. Only 'hdfs' allowed from here on out.
        stack made changes -
        Attachment copyfiles-s3-4.diff [ 12350237 ]
        Hide
        stack added a comment -

        Mr 'Hadoop QA', do I have to do anything special to re-trigger your auto-application and test of version 4 of the patch? Thanks.

        Show
        stack added a comment - Mr 'Hadoop QA', do I have to do anything special to re-trigger your auto-application and test of version 4 of the patch? Thanks.
        Hide
        Doug Cutting added a comment -

        > Mr 'Hadoop QA' [ ... ]

        Please, call him "Nigel".

        Show
        Doug Cutting added a comment - > Mr 'Hadoop QA' [ ... ] Please, call him "Nigel".
        Hide
        Hadoop QA added a comment -

        +1, because http://issues.apache.org/jira/secure/attachment/12350237/copyfiles-s3-4.diff applied and successfully tested against trunk revision r502694.

        Show
        Hadoop QA added a comment - +1, because http://issues.apache.org/jira/secure/attachment/12350237/copyfiles-s3-4.diff applied and successfully tested against trunk revision r502694.
        Hide
        Doug Cutting added a comment -

        I just committed this. Thanks, Michael!

        Show
        Doug Cutting added a comment - I just committed this. Thanks, Michael!
        Doug Cutting made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Doug Cutting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        27d 4h 12m 1 stack 02/Feb/07 06:19
        Patch Available Patch Available Resolved Resolved
        14h 2m 1 Doug Cutting 02/Feb/07 20:22
        Resolved Resolved Closed Closed
        7h 1m 1 Doug Cutting 03/Feb/07 03:23

          People

          • Assignee:
            Unassigned
            Reporter:
            stack
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development