Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.203.0, 0.23.0
    • Fix Version/s: 0.23.1
    • Component/s: distcp, mrv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      DistCpV2 added to hadoop-tools.
    • Tags:
      distcp distcpv2 DynamicInputFormat

      Description

      This is a slightly modified version of the DistCp rewrite that Yahoo uses in production today. The rewrite was ground-up, with specific focus on:
      1. improved startup time (postponing as much work as possible to the MR job)
      2. support for multiple copy-strategies
      3. new features (e.g. -atomic, -async, -bandwidth.)
      4. improved programmatic use
      Some effort has gone into refactoring what used to be achieved by a single large (1.7 KLOC) source file, into a design that (hopefully) reads better too.

      The proposed DistCpV2 preserves command-line-compatibility with the old version, and should be a drop-in replacement.

      New to v2:

      1. Copy-strategies and the DynamicInputFormat:
      A copy-strategy determines the policy by which source-file-paths are distributed between map-tasks. (These boil down to the choice of the input-format.)
      If no strategy is explicitly specified on the command-line, the policy chosen is "uniform size", where v2 behaves identically to old-DistCp. (The number of bytes transferred by each map-task is roughly equal, at a per-file granularity.)
      Alternatively, v2 ships with a "dynamic" copy-strategy (in the DynamicInputFormat). This policy acknowledges that
      (a) dividing files based only on file-size might not be an even distribution (E.g. if some datanodes are slower than others, or if some files are skipped.)
      (b) a "static" association of a source-path to a map increases the likelihood of long-tails during copy.
      The "dynamic" strategy divides the list-of-source-paths into a number (> nMaps) of smaller parts. When each map completes its current list of paths, it picks up a new list to process, if available. So if a map-task is stuck on a slow (and not necessarily large) file, other maps can pick up the slack. The thinner the file-list is sliced, the greater the parallelism (and the lower the chances of long-tails). Within reason, of course: the number of these short-lived list-files is capped at an overridable maximum.
      Internal benchmarks against source/target clusters with some slow(ish) datanodes have indicated significant performance gains when using the dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
      Please note that the DynamicInputFormat might prove useful outside of DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also note that the copy-strategies have no bearing on the CopyMapper.map() implementation.

      2. Improved startup-time and programmatic use:
      When the old-DistCp runs with -update, and creates the list-of-source-paths, it attempts to filter out files that might be skipped (by comparing file-sizes, checksums, etc.) This significantly increases the startup time (or the time spent in serial processing till the MR job is launched), blocking the calling-thread. This becomes pronounced as nFiles increases. (Internal benchmarks have seen situations where more time is spent setting up the job than on the actual transfer.)
      DistCpV2 postpones as much work as possible to the MR job. The file-listing isn't filtered until the map-task runs (at which time, identical files are skipped). DistCpV2 can now be run "asynchronously". The program quits at job-launch, logging the job-id for tracking. Programmatically, the DistCp.execute() returns a Job instance for progress-tracking.

      3. New features:
      (a) -async: As described in #2.
      (b) -atomic: Data is copied to a (user-specifiable) tmp-location, and then moved atomically to destination.
      (c) -bandwidth: Enforces a limit on the bandwidth consumed per map.
      (d) -strategy: As above.

      A more comprehensive description the newer features, how the dynamic-strategy works, etc. is available in src/site/xdoc/, and in the pdf that's generated therefrom, during the build.

      High on the list of things to do is support to parallelize copies on a per-block level. (i.e. Incorporation of HDFS-222.)

      I look forward to comments, suggestions and discussion that will hopefully ensue. I have this running against Hadoop 0.20.203.0. I also have a port to 0.23.0 (complete with unit-tests).

      P.S.
      A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for ideas, code, reviews and guidance. Although much of the code is mine, the idea to use the DFS to implement "dynamic" input-splits wasn't.

      1. distcpv2.20.203.patch
        419 kB
        Mithun Radhakrishnan
      2. distcpv2_trunk.patch
        402 kB
        Mithun Radhakrishnan
      3. distcpv2_trunk_post_review_1.patch
        403 kB
        Mithun Radhakrishnan
      4. distcpv2_patch_hadoop-trunk_tucu_reviewed.patch
        397 kB
        Mithun Radhakrishnan
      5. distcpv2_patch_0.23.1-SNAPSHOT_tucu_reviewed.patch
        397 kB
        Mithun Radhakrishnan
      6. distcpv2_hadoop-trunk.patch
        396 kB
        Mithun Radhakrishnan
      7. distcpv2_hadoop-0.23.1.patch
        396 kB
        Mithun Radhakrishnan
      8. distcpv2_hadoop-0.23.1.patch
        396 kB
        Mithun Radhakrishnan
      9. 2765_trunk.patch
        397 kB
        Mithun Radhakrishnan
      10. 2765_hadoop-branch-0.23.patch
        397 kB
        Mithun Radhakrishnan

        Issue Links

          Activity

          Mithun Radhakrishnan created issue -
          Mithun Radhakrishnan made changes -
          Field Original Value New Value
          Attachment distcpv2.20.203.patch [ 12488877 ]
          Mithun Radhakrishnan made changes -
          Summary DistCpV2: Rewrite to improve startup time, support multiple copy-strategies (including the "DynamicInputFormat"), improve copy performance and to improve code readability. DistCp Rewrite
          Mithun Radhakrishnan made changes -
          Description This is a slightly modified version of the DistCp rewrite that Yahoo(!) uses in production today. The rewrite was ground-up, with specific focus on:
          1. improved startup time (postponing as much work as possible to the MR job)
          2. support for multiple copy-strategies
          3. new features (e.g. -atomic, -async, -bandwidth.)
          4. improved programmatic use
          Some effort has gone into refactoring what used to be achieved by a single large (1.7 KLOC) source file, into a design that (hopefully) reads better too.

          The proposed DistCpV2 preserves command-line-compatibility with the old version, and should be a drop-in replacement.

          New to v2:

          1. Copy-strategies and the DynamicInputFormat:
          A copy-strategy determines the policy by which source-file-paths are distributed between map-tasks. (These boil down to the choice of the input-format.)
          If no strategy is explicitly specified on the command-line, the policy chosen is "uniform size", where v2 behaves identically to old-DistCp. (The number of bytes transferred by each map-task is roughly equal, at a per-file granularity.)
          Alternatively, v2 ships with a "dynamic" copy-strategy (in the DynamicInputFormat). This policy acknowledges that
          (i) dividing files based only on file-size might not be an even distribution (E.g. if some datanodes are slower than others, or if some files are skipped.)
          (ii) a "static" association of a source-path to a map increases the likelihood of long-tails during copy.
          The "dynamic" strategy divides the list-of-source-paths into a number (> nMaps) of smaller parts. When each map completes its current list of paths, it picks up a new list to process, if available. So if a map-task is stuck on a slow (and not necessarily large) file, other maps can pick up the slack. The thinner the file-list is sliced, the greater the parallelism (and the lower the chances of long-tails). Within reason, of course: the number of these short-lived list-files is capped at an overridable maximum.
          Internal benchmarks against source/target clusters with some slow(ish) datanodes have indicated significant performance gains when using the dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
          Please note that the DynamicInputFormat might prove useful outside of DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also note that the copy-strategies have no bearing on the CopyMapper.map() implementation.

          2. Improved startup-time and programmatic use:
          When the old-DistCp runs with -update, and creates the list-of-source-paths, it attempts to filter out files that might be skipped (by comparing file-sizes, checksums, etc.) This significantly increases the startup time (or the time spent in serial processing till the MR job is launched), blocking the calling-thread. This becomes pronounced as nFiles increases. (Internal benchmarks have seen situations where more time is spent setting up the job than on the actual transfer.)
          DistCpV2 postpones as much work as possible to the MR job. The file-listing isn't filtered until the map-task runs (at which time, identical files are skipped). DistCpV2 can now be run "asynchronously". The program quits at job-launch, logging the job-id for tracking. Programmatically, the DistCp.execute() returns a Job instance for progress-tracking.

          3. New features:
          (i) -async: As described in #2.
          (ii) -atomic: Data is copied to a (user-specifiable) tmp-location, and then moved atomically to destination.
          (iii) -bandwidth: Enforces a limit on the bandwidth consumed per map.
          (iv) -strategy: As above.

          A more comprehensive description the newer features, how the dynamic-strategy works, etc. is available in src/site/xdoc/, and in the pdf that's generated therefrom, during the build.

          I look forward to comments, suggestions and discussion that will hopefully ensue. I have this running against Hadoop 0.20.203.0. I also have a port to 0.23.0 (complete with unit-tests).

          P.S.
          A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for ideas, code, reviews and guidance. Although much of the code is mine, the idea to use the DFS to implement "dynamic" input-splits wasn't.

          This is a slightly modified version of the DistCp rewrite that Yahoo uses in production today. The rewrite was ground-up, with specific focus on:
          1. improved startup time (postponing as much work as possible to the MR job)
          2. support for multiple copy-strategies
          3. new features (e.g. -atomic, -async, -bandwidth.)
          4. improved programmatic use
          Some effort has gone into refactoring what used to be achieved by a single large (1.7 KLOC) source file, into a design that (hopefully) reads better too.

          The proposed DistCpV2 preserves command-line-compatibility with the old version, and should be a drop-in replacement.

          New to v2:

          1. Copy-strategies and the DynamicInputFormat:
          A copy-strategy determines the policy by which source-file-paths are distributed between map-tasks. (These boil down to the choice of the input-format.)
          If no strategy is explicitly specified on the command-line, the policy chosen is "uniform size", where v2 behaves identically to old-DistCp. (The number of bytes transferred by each map-task is roughly equal, at a per-file granularity.)
          Alternatively, v2 ships with a "dynamic" copy-strategy (in the DynamicInputFormat). This policy acknowledges that
          (a) dividing files based only on file-size might not be an even distribution (E.g. if some datanodes are slower than others, or if some files are skipped.)
          (b) a "static" association of a source-path to a map increases the likelihood of long-tails during copy.
          The "dynamic" strategy divides the list-of-source-paths into a number (> nMaps) of smaller parts. When each map completes its current list of paths, it picks up a new list to process, if available. So if a map-task is stuck on a slow (and not necessarily large) file, other maps can pick up the slack. The thinner the file-list is sliced, the greater the parallelism (and the lower the chances of long-tails). Within reason, of course: the number of these short-lived list-files is capped at an overridable maximum.
          Internal benchmarks against source/target clusters with some slow(ish) datanodes have indicated significant performance gains when using the dynamic-strategy. Gains are most pronounced when nFiles greatly exceeds nMaps.
          Please note that the DynamicInputFormat might prove useful outside of DistCp. It is hence available as a mapred/lib, unfettered to DistCpV2. Also note that the copy-strategies have no bearing on the CopyMapper.map() implementation.

          2. Improved startup-time and programmatic use:
          When the old-DistCp runs with -update, and creates the list-of-source-paths, it attempts to filter out files that might be skipped (by comparing file-sizes, checksums, etc.) This significantly increases the startup time (or the time spent in serial processing till the MR job is launched), blocking the calling-thread. This becomes pronounced as nFiles increases. (Internal benchmarks have seen situations where more time is spent setting up the job than on the actual transfer.)
          DistCpV2 postpones as much work as possible to the MR job. The file-listing isn't filtered until the map-task runs (at which time, identical files are skipped). DistCpV2 can now be run "asynchronously". The program quits at job-launch, logging the job-id for tracking. Programmatically, the DistCp.execute() returns a Job instance for progress-tracking.

          3. New features:
          (a) -async: As described in #2.
          (b) -atomic: Data is copied to a (user-specifiable) tmp-location, and then moved atomically to destination.
          (c) -bandwidth: Enforces a limit on the bandwidth consumed per map.
          (d) -strategy: As above.

          A more comprehensive description the newer features, how the dynamic-strategy works, etc. is available in src/site/xdoc/, and in the pdf that's generated therefrom, during the build.

          High on the list of things to do is support to parallelize copies on a per-block level. (i.e. Incorporation of HDFS-222.)

          I look forward to comments, suggestions and discussion that will hopefully ensue. I have this running against Hadoop 0.20.203.0. I also have a port to 0.23.0 (complete with unit-tests).

          P.S.
          A tip of the hat to Srikanth (Sundarrajan) and Venkatesh (Seetharamaiah), for ideas, code, reviews and guidance. Although much of the code is mine, the idea to use the DFS to implement "dynamic" input-splits wasn't.

          Amareshwari Sriramadasu made changes -
          Assignee Mithun Radhakrishnan [ mithun ]
          Mithun Radhakrishnan made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Mithun Radhakrishnan made changes -
          Attachment distcpv2_trunk.patch [ 12490238 ]
          Mithun Radhakrishnan made changes -
          Attachment distcpv2_trunk_post_review_1.patch [ 12491487 ]
          Harsh J made changes -
          Link This issue blocks HADOOP-6448 [ HADOOP-6448 ]
          Mithun Radhakrishnan made changes -
          Attachment distcpv2_hadoop-0.23.1.patch [ 12505584 ]
          Mithun Radhakrishnan made changes -
          Attachment distcpv2_hadoop-0.23.1.patch [ 12505590 ]
          Attachment distcpv2_hadoop-trunk.patch [ 12505591 ]
          Mahadev konar made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          Component/s mrv2 [ 12314301 ]
          Mithun Radhakrishnan made changes -
          Arun C Murthy made changes -
          Priority Critical [ 2 ] Major [ 3 ]
          Harsh J made changes -
          Link This issue blocks OOZIE-611 [ OOZIE-611 ]
          Mithun Radhakrishnan made changes -
          Status In Progress [ 3 ] Open [ 1 ]
          Mithun Radhakrishnan made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Mahadev konar made changes -
          Fix Version/s 0.23.1 [ 12318883 ]
          Affects Version/s 0.23.0 [ 12315570 ]
          Mithun Radhakrishnan made changes -
          Status In Progress [ 3 ] Open [ 1 ]
          Mithun Radhakrishnan made changes -
          Attachment 2765_trunk.patch [ 12511079 ]
          Attachment 2765_hadoop-branch-0.23.patch [ 12511080 ]
          Mithun Radhakrishnan made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Release Note DistCpV2 added to hadoop-tools.
          Mahadev konar made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags Reviewed [ 10343 ]
          Resolution Fixed [ 1 ]
          Arun C Murthy made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Eli Collins made changes -
          Link This issue relates to HADOOP-8768 [ HADOOP-8768 ]
          Tsz Wo Nicholas Sze made changes -
          Link This issue is related to MAPREDUCE-5081 [ MAPREDUCE-5081 ]

            People

            • Assignee:
              Mithun Radhakrishnan
              Reporter:
              Mithun Radhakrishnan
            • Votes:
              0 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development