Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-2149

Distcp : setup with update is too slow when latency is high

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.2, 0.21.0
    • Fix Version/s: None
    • Component/s: distcp
    • Labels:
      None

      Description

      If you run distcp with '-update' option, for each of the files present on source cluster setup invokes a separate RPC to destination cluster to fetch file info.
      Usually this overhead is not very noticeable when both cluster are geographically close to each other. But if the latency is large, setup could take couple of orders of magnitude longer.

      E.g. : source has 10k directories, each with about 10 files, round trip latency between source and destination is 75 ms (typical for coast-to-coast clusters).
      If we run distcp on source cluster, set up would take about 2.5 hours irrespective of whether destination has these files or not. '-lsr' on the same dest dir from source cluster would take up to 12 min (depending on how many directories already exist on dest).

      • A fairly simple fix to how setup() iterates should bring the set up time to same as '-lsr'. I will have a patch for this.. (though 12 min is too large).
      • A more scalable option is to differ update check to mappers.

        Activity

        Hide
        aw Allen Wittenauer added a comment -

        +1

        We were just talking about this last week, as we move from one data center to another.

        Show
        aw Allen Wittenauer added a comment - +1 We were just talking about this last week, as we move from one data center to another.
        Hide
        rangadi Raghu Angadi added a comment -

        A patch for the first option is attached.

        Now setup should not take longer than it takes to '-lsr' destination and source directories. This is the best we can do without parallelizing setup().

        The fix is to store entries in destination directory in a map pass it to sameFile().

        Show
        rangadi Raghu Angadi added a comment - A patch for the first option is attached. Now setup should not take longer than it takes to '-lsr' destination and source directories. This is the best we can do without parallelizing setup(). The fix is to store entries in destination directory in a map pass it to sameFile().
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12458747/MAPREDUCE-2149.patch
        against trunk revision 1074251.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/57//testReport/
        Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/57//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/57//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12458747/MAPREDUCE-2149.patch against trunk revision 1074251. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/57//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/57//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/57//console This message is automatically generated.
        Hide
        mithun Mithun Radhakrishnan added a comment -

        https://issues.apache.org/jira/browse/MAPREDUCE-2765

        This rewrite does attempt to address setup-times (as well as copy performance).

        Show
        mithun Mithun Radhakrishnan added a comment - https://issues.apache.org/jira/browse/MAPREDUCE-2765 This rewrite does attempt to address setup-times (as well as copy performance).
        Hide
        revans2 Robert Joseph Evans added a comment -

        MAPREDUCE-2765 has gone in as is resolved. If this is still enough of an issue for the 1.0 line then please upmerge the patch and resubmit it along with test patch results.

        Show
        revans2 Robert Joseph Evans added a comment - MAPREDUCE-2765 has gone in as is resolved. If this is still enough of an issue for the 1.0 line then please upmerge the patch and resubmit it along with test patch results.
        Hide
        qwertymaniac Harsh J added a comment -

        I don't appear to have rights to change the resolution status here, but note that the true resolution of this JIRA is not "Fixed" as indicated, but instead "Duplicate" (of MAPREDUCE-2765, which introduces DistCp2 for YARN/MR2).

        Show
        qwertymaniac Harsh J added a comment - I don't appear to have rights to change the resolution status here, but note that the true resolution of this JIRA is not "Fixed" as indicated, but instead "Duplicate" (of MAPREDUCE-2765 , which introduces DistCp2 for YARN/MR2).

          People

          • Assignee:
            rangadi Raghu Angadi
            Reporter:
            rangadi Raghu Angadi
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development