Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: 0.21.0
    • Component/s: distcp
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently distcp does a checksums check in addition to file length check to decide if a remote file has to be copied. If the number of files is high (thousands), this checksum check is proving to be fairly costly leading to a long time before the copy is started.

      1. mapred-1231-y20-v4.patch
        8 kB
        Jothi Padmanabhan
      2. mapred-1231-v3.patch
        8 kB
        Jothi Padmanabhan
      3. mapred-1231-y20-v3.patch
        8 kB
        Jothi Padmanabhan
      4. mapred-1231-v3.patch
        8 kB
        Jothi Padmanabhan
      5. mapred-1231-v2.patch
        7 kB
        Jothi Padmanabhan
      6. mapred-1231-y20-v2.patch
        7 kB
        Jothi Padmanabhan
      7. mapred-1231-v1.patch
        5 kB
        Jothi Padmanabhan
      8. mapred-1231-y20.patch
        4 kB
        Jothi Padmanabhan
      9. mapred-1231.patch
        2 kB
        Jothi Padmanabhan

        Activity

        Hide
        Jothi Padmanabhan added a comment -

        Need to add that this is relevant only with the -update option

        Show
        Jothi Padmanabhan added a comment - Need to add that this is relevant only with the -update option
        Hide
        Jothi Padmanabhan added a comment -

        Patch that disables checksum validations.

        The long term solution would be to do the setup, including checksum validations, as a separate M/R Job. This possibly could be done in a separate Jira.

        Show
        Jothi Padmanabhan added a comment - Patch that disables checksum validations. The long term solution would be to do the setup, including checksum validations, as a separate M/R Job. This possibly could be done in a separate Jira.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12425817/mapred-1231.patch
        against trunk revision 882790.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425817/mapred-1231.patch against trunk revision 882790. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/260/console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        Should we make skipping of checksum checks optional rather than completely remove it?

        Show
        Arun C Murthy added a comment - Should we make skipping of checksum checks optional rather than completely remove it?
        Hide
        Venkatesh Seetharam added a comment -

        Arun, this feature was added in H20 and was not available in H18. Hence, this change should be sufficient for us to address the slow start up times.

        Show
        Venkatesh Seetharam added a comment - Arun, this feature was added in H20 and was not available in H18. Hence, this change should be sufficient for us to address the slow start up times.
        Hide
        Jothi Padmanabhan added a comment -

        Patch that makes CRC checks optional by introducing -skipcrccheck option. This is for the Y!20 patch, attaching before the trunk version so that hudson will pick up the correct patch

        Show
        Jothi Padmanabhan added a comment - Patch that makes CRC checks optional by introducing -skipcrccheck option. This is for the Y!20 patch, attaching before the trunk version so that hudson will pick up the correct patch
        Hide
        Jothi Padmanabhan added a comment -

        Patch for trunk that makes the crc checking optional.

        Show
        Jothi Padmanabhan added a comment - Patch for trunk that makes the crc checking optional.
        Hide
        Jothi Padmanabhan added a comment -

        Added a test case to both the patches

        Show
        Jothi Padmanabhan added a comment - Added a test case to both the patches
        Hide
        Aaron Kimball added a comment -

        Arun: I agree that using the checksums should be the default if this is the behavior that users have come to expect.

        rsync does its transfer-list generation based on file length and mtime (and, optionally, a checksum as well with -c). Checking only on the length feels risky to me. I think that it would be much more convincing if sameFile() included an mtime comparison. This shouldn't add significant processing overhead, since the same getStatus() call returns both length and mtime. (For this to work right, though, you'd need to preserve the mtime on the receiver OS, which is done with -pt. So maybe disabling checksum should also recommend the use of -pt in a message to the user?)

        And if mtime comparisons were added, that should also be user-ignorable.

        Show
        Aaron Kimball added a comment - Arun: I agree that using the checksums should be the default if this is the behavior that users have come to expect. rsync does its transfer-list generation based on file length and mtime (and, optionally, a checksum as well with -c ). Checking only on the length feels risky to me. I think that it would be much more convincing if sameFile() included an mtime comparison. This shouldn't add significant processing overhead, since the same getStatus() call returns both length and mtime. (For this to work right, though, you'd need to preserve the mtime on the receiver OS, which is done with -pt . So maybe disabling checksum should also recommend the use of -pt in a message to the user?) And if mtime comparisons were added, that should also be user-ignorable.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        > ... Checking only on the length feels risky to me. ...
        Yes, that why -skipcrccheck is an option. Checking mtime definitely is a good idea but it probably should be done in a separated issue.

        Show
        Tsz Wo Nicholas Sze added a comment - > ... Checking only on the length feels risky to me. ... Yes, that why -skipcrccheck is an option. Checking mtime definitely is a good idea but it probably should be done in a separated issue.
        Hide
        Tsz Wo Nicholas Sze added a comment -

        Patch looks mostly good. Some comments below.

        • Why renaming testCopyDuplication() to atestCopyDuplication()? Typo?
        • If -skipcrccheck is specified without -update, it should print an error message and exit.
        • Could you change ".ignore.crc" in the following to ".skip.crc.check"? It better to keep the wording consistent.
          +    SKIPCRC("-skipcrccheck", NAME + ".ignore.crc");
          
        Show
        Tsz Wo Nicholas Sze added a comment - Patch looks mostly good. Some comments below. Why renaming testCopyDuplication() to atestCopyDuplication()? Typo? If -skipcrccheck is specified without -update, it should print an error message and exit. Could you change ".ignore.crc" in the following to ".skip.crc.check"? It better to keep the wording consistent. + SKIPCRC( "-skipcrccheck" , NAME + ".ignore.crc" );
        Hide
        Arun C Murthy added a comment -

        Cancelling patch while Nic's comments are addressed.

        @Venkatesh - We cannot remove a feature which has already been released in 0.20.0; hence we need to make it optional.

        Show
        Arun C Murthy added a comment - Cancelling patch while Nic's comments are addressed. @Venkatesh - We cannot remove a feature which has already been released in 0.20.0; hence we need to make it optional.
        Hide
        Jothi Padmanabhan added a comment -

        Review comments incorporated

        Show
        Jothi Padmanabhan added a comment - Review comments incorporated
        Hide
        Tsz Wo Nicholas Sze added a comment -

        +1 patch looks good.

        Thanks, Jothi.

        Show
        Tsz Wo Nicholas Sze added a comment - +1 patch looks good. Thanks, Jothi.
        Hide
        Jothi Padmanabhan added a comment -

        Y20 patch with review comments incorporated

        Show
        Jothi Padmanabhan added a comment - Y20 patch with review comments incorporated
        Hide
        Jothi Padmanabhan added a comment -

        Attaching the same trunk patch again so that Hudson picks this up.

        Show
        Jothi Padmanabhan added a comment - Attaching the same trunk patch again so that Hudson picks this up.
        Hide
        Jothi Padmanabhan added a comment -

        Trying to pass through Hudson once more

        Show
        Jothi Padmanabhan added a comment - Trying to pass through Hudson once more
        Hide
        Hemanth Yamijala added a comment -

        I looked at the Yahoo! Hadoop 0.20 patch. One minor nit is that the internal config option name is different between this and the trunk patch. In the trunk patch, the option is distcp.skip.crc.check. In the internal patch it is distcp.skip.crc. Since this is a jobconf option, it may be better to keep these in sync. At the very least, it avoids confusion when Hadoop is upgraded to the trunk version.

        Other than this, the 20 patch looks good.

        Another point, (unrelated to this JIRA), is that the way the post-copy validation is done between trunk and 20 seems different. In trunk, this is done by a call to the API sameFile(). Hence, it includes CRC checks by default. In the internal 20 patch, this check is done only on file lengths irrespective of the option to skip crc checks. It is unclear whether this is by design. At any rate, this inconsistency is not related to this patch.

        Show
        Hemanth Yamijala added a comment - I looked at the Yahoo! Hadoop 0.20 patch. One minor nit is that the internal config option name is different between this and the trunk patch. In the trunk patch, the option is distcp.skip.crc.check. In the internal patch it is distcp.skip.crc. Since this is a jobconf option, it may be better to keep these in sync. At the very least, it avoids confusion when Hadoop is upgraded to the trunk version. Other than this, the 20 patch looks good. Another point, (unrelated to this JIRA), is that the way the post-copy validation is done between trunk and 20 seems different. In trunk, this is done by a call to the API sameFile(). Hence, it includes CRC checks by default. In the internal 20 patch, this check is done only on file lengths irrespective of the option to skip crc checks. It is unclear whether this is by design. At any rate, this inconsistency is not related to this patch.
        Hide
        Jothi Padmanabhan added a comment -

        Thanks for the review, Hemanth. And for the good catch. Here is a patch with the option name changed to .skip.crc.check

        Show
        Jothi Padmanabhan added a comment - Thanks for the review, Hemanth. And for the good catch. Here is a patch with the option name changed to .skip.crc.check
        Hide
        Hemanth Yamijala added a comment -

        +1 for the change.

        Show
        Hemanth Yamijala added a comment - +1 for the change.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12426190/mapred-1231-v3.patch
        against trunk revision 884628.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426190/mapred-1231-v3.patch against trunk revision 884628. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/272/console This message is automatically generated.
        Hide
        Jothi Padmanabhan added a comment -

        Test failures are unrelated. See MAPREDUCE-1124 and MAPREDUCE-1245

        Show
        Jothi Padmanabhan added a comment - Test failures are unrelated. See MAPREDUCE-1124 and MAPREDUCE-1245
        Hide
        Tsz Wo Nicholas Sze added a comment -

        I have committed this. Thanks, Jothi!

        Show
        Tsz Wo Nicholas Sze added a comment - I have committed this. Thanks, Jothi!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #138 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/138/)
        . Added a new DistCp option, -skipcrccheck, so that the CRC check during setup can be skipped. Contributed by Jothi Padmanabhan

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #138 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/138/ ) . Added a new DistCp option, -skipcrccheck, so that the CRC check during setup can be skipped. Contributed by Jothi Padmanabhan
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #162 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/162/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #162 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/162/ )

          People

          • Assignee:
            Jothi Padmanabhan
            Reporter:
            Jothi Padmanabhan
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development