Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-13169

Randomize file list in SimpleCopyListing

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.8.0
    • tools/distcp
    • None
    • Reviewed

    Description

      When copying files to S3, based on file listing some mappers can get into S3 partition hotspots. This would be more visible, when data is copied from hive warehouse with lots of partitions (e.g date partitions). In such cases, some of the tasks would tend to be a lot more slower than others. It would be good to randomize the file paths which are written out in SimpleCopyListing to avoid this issue.

      Attachments

        1. HADOOP-13169-branch-2-010.patch
          15 kB
          Rajesh Balamohan
        2. HADOOP-13169-branch-2-009.patch
          15 kB
          Rajesh Balamohan
        3. HADOOP-13169-branch-2-008.patch
          14 kB
          Rajesh Balamohan
        4. HADOOP-13169-branch-2-007.patch
          13 kB
          Rajesh Balamohan
        5. HADOOP-13169-branch-2-006.patch
          14 kB
          Rajesh Balamohan
        6. HADOOP-13169-branch-2-005.patch
          12 kB
          Rajesh Balamohan
        7. HADOOP-13169-branch-2-004.patch
          8 kB
          Rajesh Balamohan
        8. HADOOP-13169-branch-2-003.patch
          8 kB
          Rajesh Balamohan
        9. HADOOP-13169-branch-2-002.patch
          6 kB
          Rajesh Balamohan
        10. HADOOP-13169-branch-2-001.patch
          7 kB
          Rajesh Balamohan

        Issue Links

          Activity

            TestCopyListing covers this codepath. Encountered HADOOP-13170 when running unit tests; included the trivial fix it.

            rajesh.balamohan Rajesh Balamohan added a comment - TestCopyListing covers this codepath. Encountered HADOOP-13170 when running unit tests; included the trivial fix it.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 0s Docker mode activated.
            -1 patch 0m 7s HADOOP-13169 does not apply to branch-2. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 7s HADOOP-13169 does not apply to branch-2. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12804580/HADOOP-13169-branch-2-001.patch JIRA Issue HADOOP-13169 Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/9478/console Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            rajesh.balamohan Rajesh Balamohan added a comment - - edited

            Removed TestOptionsParser change which was already available in branch-2 by HDFS-10397 recently.

            rajesh.balamohan Rajesh Balamohan added a comment - - edited Removed TestOptionsParser change which was already available in branch-2 by HDFS-10397 recently.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 12s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
            -1 mvninstall 0m 44s root in branch-2 failed.
            +1 compile 0m 24s branch-2 passed with JDK v1.8.0_91
            +1 compile 0m 15s branch-2 passed with JDK v1.7.0_101
            +1 checkstyle 0m 12s branch-2 passed
            +1 mvnsite 0m 20s branch-2 passed
            +1 mvneclipse 0m 39s branch-2 passed
            +1 findbugs 0m 27s branch-2 passed
            +1 javadoc 0m 9s branch-2 passed with JDK v1.8.0_91
            +1 javadoc 0m 12s branch-2 passed with JDK v1.7.0_101
            -1 mvninstall 0m 18s hadoop-distcp in the patch failed.
            +1 compile 0m 12s the patch passed with JDK v1.8.0_91
            +1 javac 0m 12s the patch passed
            +1 compile 0m 14s the patch passed with JDK v1.7.0_101
            +1 javac 0m 14s the patch passed
            -1 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 2 new + 18 unchanged - 0 fixed = 20 total (was 18)
            +1 mvnsite 0m 20s the patch passed
            +1 mvneclipse 0m 10s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 0m 37s the patch passed
            +1 javadoc 0m 10s the patch passed with JDK v1.8.0_91
            +1 javadoc 0m 13s the patch passed with JDK v1.7.0_101
            +1 unit 8m 29s hadoop-distcp in the patch passed with JDK v1.8.0_91.
            +1 unit 7m 28s hadoop-distcp in the patch passed with JDK v1.7.0_101.
            +1 asflicense 0m 17s The patch does not generate ASF License warnings.
            23m 8s



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:babe025
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12804597/HADOOP-13169-branch-2-002.patch
            JIRA Issue HADOOP-13169
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux f3d0f59903a2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision branch-2 / e205421
            Default Java 1.7.0_101
            Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101
            mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/artifact/patchprocess/branch-mvninstall-root.txt
            findbugs v3.0.0
            mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/artifact/patchprocess/patch-mvninstall-hadoop-tools_hadoop-distcp.txt
            checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
            JDK v1.7.0_101 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/testReport/
            modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
            Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/console
            Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 mvninstall 0m 44s root in branch-2 failed. +1 compile 0m 24s branch-2 passed with JDK v1.8.0_91 +1 compile 0m 15s branch-2 passed with JDK v1.7.0_101 +1 checkstyle 0m 12s branch-2 passed +1 mvnsite 0m 20s branch-2 passed +1 mvneclipse 0m 39s branch-2 passed +1 findbugs 0m 27s branch-2 passed +1 javadoc 0m 9s branch-2 passed with JDK v1.8.0_91 +1 javadoc 0m 12s branch-2 passed with JDK v1.7.0_101 -1 mvninstall 0m 18s hadoop-distcp in the patch failed. +1 compile 0m 12s the patch passed with JDK v1.8.0_91 +1 javac 0m 12s the patch passed +1 compile 0m 14s the patch passed with JDK v1.7.0_101 +1 javac 0m 14s the patch passed -1 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 2 new + 18 unchanged - 0 fixed = 20 total (was 18) +1 mvnsite 0m 20s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 37s the patch passed +1 javadoc 0m 10s the patch passed with JDK v1.8.0_91 +1 javadoc 0m 13s the patch passed with JDK v1.7.0_101 +1 unit 8m 29s hadoop-distcp in the patch passed with JDK v1.8.0_91. +1 unit 7m 28s hadoop-distcp in the patch passed with JDK v1.7.0_101. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 23m 8s Subsystem Report/Notes Docker Image:yetus/hadoop:babe025 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12804597/HADOOP-13169-branch-2-002.patch JIRA Issue HADOOP-13169 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f3d0f59903a2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / e205421 Default Java 1.7.0_101 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_101 mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/artifact/patchprocess/branch-mvninstall-root.txt findbugs v3.0.0 mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/artifact/patchprocess/patch-mvninstall-hadoop-tools_hadoop-distcp.txt checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt JDK v1.7.0_101 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/9483/console Powered by Apache Yetus 0.3.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 20s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
            +1 mvninstall 6m 55s branch-2 passed
            +1 compile 0m 17s branch-2 passed with JDK v1.8.0_101
            +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111
            +1 checkstyle 0m 16s branch-2 passed
            +1 mvnsite 0m 24s branch-2 passed
            +1 mvneclipse 0m 21s branch-2 passed
            +1 findbugs 0m 33s branch-2 passed
            +1 javadoc 0m 13s branch-2 passed with JDK v1.8.0_101
            +1 javadoc 0m 15s branch-2 passed with JDK v1.7.0_111
            +1 mvninstall 0m 18s the patch passed
            +1 compile 0m 13s the patch passed with JDK v1.8.0_101
            +1 javac 0m 13s the patch passed
            +1 compile 0m 17s the patch passed with JDK v1.7.0_111
            +1 javac 0m 17s the patch passed
            -0 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 4 new + 39 unchanged - 0 fixed = 43 total (was 39)
            +1 mvnsite 0m 22s the patch passed
            +1 mvneclipse 0m 13s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 0m 41s the patch passed
            +1 javadoc 0m 10s the patch passed with JDK v1.8.0_101
            +1 javadoc 0m 13s the patch passed with JDK v1.7.0_111
            +1 unit 7m 14s hadoop-distcp in the patch passed with JDK v1.7.0_111.
            +1 asflicense 0m 17s The patch does not generate ASF License warnings.
            29m 45s



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:b59b8b7
            JIRA Issue HADOOP-13169
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827497/HADOOP-13169-branch-2-003.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux 4aa8b1ea64e4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision branch-2 / 5f8b6f0
            Default Java 1.7.0_111
            Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
            findbugs v3.0.0
            checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10461/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
            JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10461/testReport/
            modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
            Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10461/console
            Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 55s branch-2 passed +1 compile 0m 17s branch-2 passed with JDK v1.8.0_101 +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111 +1 checkstyle 0m 16s branch-2 passed +1 mvnsite 0m 24s branch-2 passed +1 mvneclipse 0m 21s branch-2 passed +1 findbugs 0m 33s branch-2 passed +1 javadoc 0m 13s branch-2 passed with JDK v1.8.0_101 +1 javadoc 0m 15s branch-2 passed with JDK v1.7.0_111 +1 mvninstall 0m 18s the patch passed +1 compile 0m 13s the patch passed with JDK v1.8.0_101 +1 javac 0m 13s the patch passed +1 compile 0m 17s the patch passed with JDK v1.7.0_111 +1 javac 0m 17s the patch passed -0 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 4 new + 39 unchanged - 0 fixed = 43 total (was 39) +1 mvnsite 0m 22s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 41s the patch passed +1 javadoc 0m 10s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 13s the patch passed with JDK v1.7.0_111 +1 unit 7m 14s hadoop-distcp in the patch passed with JDK v1.7.0_111. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 29m 45s Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13169 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827497/HADOOP-13169-branch-2-003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4aa8b1ea64e4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 5f8b6f0 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10461/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10461/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10461/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            stevel@apache.org Steve Loughran added a comment -

            this is a good find.

            1. Have you run the hadoop-amazon and hadoop-azure distcp tests with this?
            2. Concurrency. You've marked statusList as a concurrent list, implying there's concurrent use. But in writeToFileListing the list is shuffled,used and then clear()-d in an unsynced block. Is there anything to stop elements being added to the list between that shuffle/use and the clear() operation? as that would lose entries in the distcp
            stevel@apache.org Steve Loughran added a comment - this is a good find. Have you run the hadoop-amazon and hadoop-azure distcp tests with this? Concurrency. You've marked statusList as a concurrent list, implying there's concurrent use. But in writeToFileListing the list is shuffled,used and then clear()-d in an unsynced block. Is there anything to stop elements being added to the list between that shuffle/use and the clear() operation? as that would lose entries in the distcp

            I have run distcp in S3. Results show that there is good improvement with the patch (>100% reduction in overall runtime in the distcp I had tried out with). E.g result is given below.

            tools.SimpleCopyListing: Paths (files+dirs) cnt = 24277; dirCnt = 12031
            
            Without patch:
            ===========
            time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000  -Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db s3a://<bucket>/distcp/12/
            real	73m32.806s
            user	0m59.452s
            sys	0m3.904s
            
            With patch:
            ===========
            time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000  -Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db s3a://<bucket>/distcp/15/
            real	33m18.606s
            user	0m53.720s
            sys	0m3.320s
            

            From unit-test perspective, TestS3AContractDistCp passed in S3 which isn't executed as a part of normal test suite. There were errors in TestS3AContractRootDir, but that is unrelated to this change.

            For the concurrency part, statusList would not be changed by workers in writeToFileListing. But added it in synchronized block in the latest patch to ensure that it does not impact for any changes later point in time.

            Also, noticed that the distcp does not have documentation for some of the parameters listed in DistcpConstants. E.g "distcp.work.path, distcp.log.path, distcp.listing.file.path" etc. Will create
            separate ticket to address documentation fixes.

            rajesh.balamohan Rajesh Balamohan added a comment - I have run distcp in S3. Results show that there is good improvement with the patch (>100% reduction in overall runtime in the distcp I had tried out with). E.g result is given below. tools.SimpleCopyListing: Paths (files+dirs) cnt = 24277; dirCnt = 12031 Without patch: =========== time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000 -Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db s3a://<bucket>/distcp/12/ real 73m32.806s user 0m59.452s sys 0m3.904s With patch: =========== time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000 -Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db s3a://<bucket>/distcp/15/ real 33m18.606s user 0m53.720s sys 0m3.320s From unit-test perspective, TestS3AContractDistCp passed in S3 which isn't executed as a part of normal test suite. There were errors in TestS3AContractRootDir , but that is unrelated to this change. For the concurrency part, statusList would not be changed by workers in writeToFileListing . But added it in synchronized block in the latest patch to ensure that it does not impact for any changes later point in time. Also, noticed that the distcp does not have documentation for some of the parameters listed in DistcpConstants . E.g " distcp.work.path , distcp.log.path , distcp.listing.file.path " etc. Will create separate ticket to address documentation fixes.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 20s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
            +1 mvninstall 6m 39s branch-2 passed
            +1 compile 0m 16s branch-2 passed with JDK v1.8.0_101
            +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111
            +1 checkstyle 0m 16s branch-2 passed
            +1 mvnsite 0m 23s branch-2 passed
            +1 mvneclipse 0m 14s branch-2 passed
            +1 findbugs 0m 31s branch-2 passed
            +1 javadoc 0m 13s branch-2 passed with JDK v1.8.0_101
            +1 javadoc 0m 15s branch-2 passed with JDK v1.7.0_111
            +1 mvninstall 0m 18s the patch passed
            +1 compile 0m 13s the patch passed with JDK v1.8.0_101
            +1 javac 0m 13s the patch passed
            +1 compile 0m 16s the patch passed with JDK v1.7.0_111
            +1 javac 0m 16s the patch passed
            -0 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 4 new + 39 unchanged - 0 fixed = 43 total (was 39)
            +1 mvnsite 0m 21s the patch passed
            +1 mvneclipse 0m 12s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 0m 40s the patch passed
            +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101
            +1 javadoc 0m 12s the patch passed with JDK v1.7.0_111
            +1 unit 7m 12s hadoop-distcp in the patch passed with JDK v1.7.0_111.
            +1 asflicense 0m 16s The patch does not generate ASF License warnings.
            28m 51s



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:b59b8b7
            JIRA Issue HADOOP-13169
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827961/HADOOP-13169-branch-2-004.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux b1d0feeb9d40 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision branch-2 / 6948691
            Default Java 1.7.0_111
            Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
            findbugs v3.0.0
            checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10478/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
            JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10478/testReport/
            modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
            Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10478/console
            Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 6m 39s branch-2 passed +1 compile 0m 16s branch-2 passed with JDK v1.8.0_101 +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111 +1 checkstyle 0m 16s branch-2 passed +1 mvnsite 0m 23s branch-2 passed +1 mvneclipse 0m 14s branch-2 passed +1 findbugs 0m 31s branch-2 passed +1 javadoc 0m 13s branch-2 passed with JDK v1.8.0_101 +1 javadoc 0m 15s branch-2 passed with JDK v1.7.0_111 +1 mvninstall 0m 18s the patch passed +1 compile 0m 13s the patch passed with JDK v1.8.0_101 +1 javac 0m 13s the patch passed +1 compile 0m 16s the patch passed with JDK v1.7.0_111 +1 javac 0m 16s the patch passed -0 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 4 new + 39 unchanged - 0 fixed = 43 total (was 39) +1 mvnsite 0m 21s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 40s the patch passed +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 12s the patch passed with JDK v1.7.0_111 +1 unit 7m 12s hadoop-distcp in the patch passed with JDK v1.7.0_111. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 28m 51s Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13169 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827961/HADOOP-13169-branch-2-004.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b1d0feeb9d40 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 6948691 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10478/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10478/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10478/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.

            Sharing the HDFS based numbers based on the patch:

            Source data size
            448782412981  /apps/hive/warehouse/tpcds_bin_partitioned_orc.db
            
            Without Patch:
            ==========
            real	21m46.126s
            user	0m42.566s
            sys	0m3.282s
            
            With Patch:
            ==========
            real	12m23.091s
            user	0m38.096s
            sys	0m2.686s
            

            This was on 20 node cluster, which shows good improvement with HDFS as well. With randomization, CopyMapper could get better locality when copying over data in HDFS. Without the patch, CopyMapper could end up reading data remotely for file copying (e.g file paths from the listing "web_returns/wr_returned_date_sk=2452626/000135_0, web_returns/wr_returned_date_sk=2452624/000121_0 ...").

            Recent patch also has the option to turn off this feature on optional basis distcp.simplelisting.randomize.files=false, distcp.simplelisting.file.status.size=1000. Also included a test case
            in TestCopyListing

            rajesh.balamohan Rajesh Balamohan added a comment - Sharing the HDFS based numbers based on the patch: Source data size 448782412981 /apps/hive/warehouse/tpcds_bin_partitioned_orc.db Without Patch: ========== real 21m46.126s user 0m42.566s sys 0m3.282s With Patch: ========== real 12m23.091s user 0m38.096s sys 0m2.686s This was on 20 node cluster, which shows good improvement with HDFS as well. With randomization, CopyMapper could get better locality when copying over data in HDFS. Without the patch, CopyMapper could end up reading data remotely for file copying (e.g file paths from the listing "web_returns/wr_returned_date_sk=2452626/000135_0, web_returns/wr_returned_date_sk=2452624/000121_0 ..."). Recent patch also has the option to turn off this feature on optional basis distcp.simplelisting.randomize.files=false , distcp.simplelisting.file.status.size=1000 . Also included a test case in TestCopyListing
            hadoopqa Hadoop QA added a comment -
            +1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 21s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
            +1 mvninstall 6m 33s branch-2 passed
            +1 compile 0m 15s branch-2 passed with JDK v1.8.0_101
            +1 compile 0m 18s branch-2 passed with JDK v1.7.0_111
            +1 checkstyle 0m 15s branch-2 passed
            +1 mvnsite 0m 23s branch-2 passed
            +1 mvneclipse 0m 19s branch-2 passed
            +1 findbugs 0m 31s branch-2 passed
            +1 javadoc 0m 12s branch-2 passed with JDK v1.8.0_101
            +1 javadoc 0m 15s branch-2 passed with JDK v1.7.0_111
            +1 mvninstall 0m 18s the patch passed
            +1 compile 0m 13s the patch passed with JDK v1.8.0_101
            +1 javac 0m 13s the patch passed
            +1 compile 0m 16s the patch passed with JDK v1.7.0_111
            +1 javac 0m 16s the patch passed
            -0 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 7 new + 51 unchanged - 1 fixed = 58 total (was 52)
            +1 mvnsite 0m 21s the patch passed
            +1 mvneclipse 0m 12s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 0m 39s the patch passed
            +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101
            +1 javadoc 0m 13s the patch passed with JDK v1.7.0_111
            +1 unit 7m 13s hadoop-distcp in the patch passed with JDK v1.7.0_111.
            +1 asflicense 0m 17s The patch does not generate ASF License warnings.
            29m 5s



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:b59b8b7
            JIRA Issue HADOOP-13169
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828163/HADOOP-13169-branch-2-005.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux 51795f4053e4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision branch-2 / 6948691
            Default Java 1.7.0_111
            Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
            findbugs v3.0.0
            checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10489/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
            JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10489/testReport/
            modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
            Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10489/console
            Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 33s branch-2 passed +1 compile 0m 15s branch-2 passed with JDK v1.8.0_101 +1 compile 0m 18s branch-2 passed with JDK v1.7.0_111 +1 checkstyle 0m 15s branch-2 passed +1 mvnsite 0m 23s branch-2 passed +1 mvneclipse 0m 19s branch-2 passed +1 findbugs 0m 31s branch-2 passed +1 javadoc 0m 12s branch-2 passed with JDK v1.8.0_101 +1 javadoc 0m 15s branch-2 passed with JDK v1.7.0_111 +1 mvninstall 0m 18s the patch passed +1 compile 0m 13s the patch passed with JDK v1.8.0_101 +1 javac 0m 13s the patch passed +1 compile 0m 16s the patch passed with JDK v1.7.0_111 +1 javac 0m 16s the patch passed -0 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 7 new + 51 unchanged - 1 fixed = 58 total (was 52) +1 mvnsite 0m 21s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 39s the patch passed +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 13s the patch passed with JDK v1.7.0_111 +1 unit 7m 13s hadoop-distcp in the patch passed with JDK v1.7.0_111. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 29m 5s Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13169 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828163/HADOOP-13169-branch-2-005.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 51795f4053e4 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 6948691 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/10489/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10489/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10489/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            cnauroth Chris Nauroth added a comment -

            rajesh.balamohan, thank you for this patch, and thank you also for the performance testing on both S3A and HDFS. I have a few comments.

              private int fileStatusLimit = 1000;
              private boolean randomizeFileListing;
            

            Please make both of these final, because they only need to be assigned inside the constructor.

                LOG.info("numListstatusThreads=" + numListstatusThreads
                    + ", fileStatusLimit=" + fileStatusLimit
                    + ", randomizeFileListing=" + randomizeFileListing);
            
                  LOG.info("Number of paths written to fileListing=" + fileStatusInfoList.size());
            

            Would you please switch these to debug level? These messages are more like the debug level logging that this class already does.

                  List<FileStatusInfo> fileStatuses =
                      Collections.synchronizedList(new ArrayList<FileStatusInfo>());
            
                  List<FileStatusInfo> statusList =
                      Collections.synchronizedList(new ArrayList<FileStatusInfo>());
            

            Is the locking on these lists necessary? If I am reading this code correctly, concurrency happens through the ProducerConsumer executing FileStatusProcessor in parallel, but I don't see the FileStatusProcessor threads accessing these lists. Instead, it appears that completed work is pulled back out of the ProducerConsumer and added to the lists on a single thread. (If I am misreading this code, please let me know.)

                  List<Path> srcPaths = new ArrayList<Path>();
            

            Lines like this can use the Java 7 diamond operator, so new ArrayList<>(). This patch is targeted to 2.8.0, and we require at least Java 7 on that branch, so using the diamond operator is fine.

                    Path p = new Path("/tmp/", String.valueOf(i));
            

            Minor nit: please remove the trailing slash from "/tmp/", as the Path constructor will normalize and strip trailing slashes anyway.

                    OutputStream out = fs.create(fileName);
                    out.write(i);
                    out.close();
            

            Please use try-finally or try-with-resources to guarantee close of each stream. It's unlikely that this will ever be a problem in practice, but some platforms like Windows are very picky about leaked file handles.

                    Assert.fail("Should have failed as file listing should be randomized");
            

            Is there a random chance that this assertion could fail if the call to Collections#shuffle results in the same order that was input?

                } catch (IOException e) {
                  LOG.error("Exception encountered ", e);
                  Assert.fail("Test build listing failed");
            

            I suggest not catching the exception and simply letting it be thrown out of the test method. If the test fails due to an exception, this would make the problem more visible in the JUnit report on Jenkins.

            I know the other tests in this suite are coded to catch the exception. No need to go back and clean up all of those as part of this patch. (I'd still +1 it though if you feel like doing that clean-up.)

                    Assert.assertEquals(fs.makeQualified(srcFiles.get(idx)), currentVal.getPath());
            

            For this assertion, I suggest passing a descriptive message including the value of idx as the first argument. That way, if we ever see a failure, we'll have more information about which iteration of the loop failed.

            The Checkstyle warnings are for exceeding the maximum line length. Please fix those in the next patch revision.

            cnauroth Chris Nauroth added a comment - rajesh.balamohan , thank you for this patch, and thank you also for the performance testing on both S3A and HDFS. I have a few comments. private int fileStatusLimit = 1000; private boolean randomizeFileListing; Please make both of these final , because they only need to be assigned inside the constructor. LOG.info( "numListstatusThreads=" + numListstatusThreads + ", fileStatusLimit=" + fileStatusLimit + ", randomizeFileListing=" + randomizeFileListing); LOG.info( " Number of paths written to fileListing=" + fileStatusInfoList.size()); Would you please switch these to debug level? These messages are more like the debug level logging that this class already does. List<FileStatusInfo> fileStatuses = Collections.synchronizedList( new ArrayList<FileStatusInfo>()); List<FileStatusInfo> statusList = Collections.synchronizedList( new ArrayList<FileStatusInfo>()); Is the locking on these lists necessary? If I am reading this code correctly, concurrency happens through the ProducerConsumer executing FileStatusProcessor in parallel, but I don't see the FileStatusProcessor threads accessing these lists. Instead, it appears that completed work is pulled back out of the ProducerConsumer and added to the lists on a single thread. (If I am misreading this code, please let me know.) List<Path> srcPaths = new ArrayList<Path>(); Lines like this can use the Java 7 diamond operator, so new ArrayList<>() . This patch is targeted to 2.8.0, and we require at least Java 7 on that branch, so using the diamond operator is fine. Path p = new Path( "/tmp/" , String .valueOf(i)); Minor nit: please remove the trailing slash from "/tmp/", as the Path constructor will normalize and strip trailing slashes anyway. OutputStream out = fs.create(fileName); out.write(i); out.close(); Please use try-finally or try-with-resources to guarantee close of each stream. It's unlikely that this will ever be a problem in practice, but some platforms like Windows are very picky about leaked file handles. Assert.fail( "Should have failed as file listing should be randomized" ); Is there a random chance that this assertion could fail if the call to Collections#shuffle results in the same order that was input? } catch (IOException e) { LOG.error( "Exception encountered " , e); Assert.fail( "Test build listing failed" ); I suggest not catching the exception and simply letting it be thrown out of the test method. If the test fails due to an exception, this would make the problem more visible in the JUnit report on Jenkins. I know the other tests in this suite are coded to catch the exception. No need to go back and clean up all of those as part of this patch. (I'd still +1 it though if you feel like doing that clean-up.) Assert.assertEquals(fs.makeQualified(srcFiles.get(idx)), currentVal.getPath()); For this assertion, I suggest passing a descriptive message including the value of idx as the first argument. That way, if we ever see a failure, we'll have more information about which iteration of the loop failed. The Checkstyle warnings are for exceeding the maximum line length. Please fix those in the next patch revision.

            Thank you very much for the review cnauroth.

            Changes:
            1. Made fileStatusLimit, randomizeFileListing as final fields.
            2. Fixed logging to debug level in SimpleCopyListing related change.
            3. You are correct about the synchronizedList related change. It is not accessed in multi-threaded mode. Marked it as LinkedList.
            4. Added diamond operator instead of new ArrayList<Path>
            5. Added "try-with-resources" in test case
            6. Removed the IOException in test case and let it throw the exception.
            7. Fixed "/tmp/" in Path
            8. Added better error message by including idx in test case
            9. For "Collection.shuffle()", it ends up shuffling 10 items (1,2,..10). If its smaller list, there are higher chances of getting the same result. With more items (increased to 100 now), it might not be the case. Please correct me if I am wrong.
            10. Fixed the checkstyle issues.

            rajesh.balamohan Rajesh Balamohan added a comment - Thank you very much for the review cnauroth . Changes: 1. Made fileStatusLimit, randomizeFileListing as final fields. 2. Fixed logging to debug level in SimpleCopyListing related change. 3. You are correct about the synchronizedList related change. It is not accessed in multi-threaded mode. Marked it as LinkedList. 4. Added diamond operator instead of new ArrayList<Path> 5. Added "try-with-resources" in test case 6. Removed the IOException in test case and let it throw the exception. 7. Fixed "/tmp/" in Path 8. Added better error message by including idx in test case 9. For "Collection.shuffle()", it ends up shuffling 10 items (1,2,..10). If its smaller list, there are higher chances of getting the same result. With more items (increased to 100 now), it might not be the case. Please correct me if I am wrong. 10. Fixed the checkstyle issues.
            hadoopqa Hadoop QA added a comment -
            -1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 18s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
            +1 mvninstall 6m 29s branch-2 passed
            +1 compile 0m 15s branch-2 passed with JDK v1.8.0_101
            +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111
            +1 checkstyle 0m 15s branch-2 passed
            +1 mvnsite 0m 24s branch-2 passed
            +1 mvneclipse 0m 15s branch-2 passed
            +1 findbugs 0m 31s branch-2 passed
            +1 javadoc 0m 12s branch-2 passed with JDK v1.8.0_101
            +1 javadoc 0m 16s branch-2 passed with JDK v1.7.0_111
            +1 mvninstall 0m 18s the patch passed
            +1 compile 0m 12s the patch passed with JDK v1.8.0_101
            +1 javac 0m 12s the patch passed
            +1 compile 0m 16s the patch passed with JDK v1.7.0_111
            +1 javac 0m 16s the patch passed
            +1 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 0 new + 51 unchanged - 1 fixed = 51 total (was 52)
            +1 mvnsite 0m 21s the patch passed
            +1 mvneclipse 0m 12s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 0m 39s the patch passed
            +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101
            +1 javadoc 0m 12s the patch passed with JDK v1.7.0_111
            -1 unit 9m 6s hadoop-distcp in the patch failed with JDK v1.7.0_111.
            +1 asflicense 0m 17s The patch does not generate ASF License warnings.
            32m 18s



            Reason Tests
            JDK v1.8.0_101 Failed junit tests hadoop.tools.TestCopyListing
            JDK v1.7.0_111 Failed junit tests hadoop.tools.TestCopyListing



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:b59b8b7
            JIRA Issue HADOOP-13169
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828422/HADOOP-13169-branch-2-006.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux 296911ae3b7b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision branch-2 / 3f36ac9
            Default Java 1.7.0_111
            Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
            findbugs v3.0.0
            unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10506/artifact/patchprocess/patch-unit-hadoop-tools_hadoop-distcp-jdk1.7.0_111.txt
            JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10506/testReport/
            modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
            Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10506/console
            Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 29s branch-2 passed +1 compile 0m 15s branch-2 passed with JDK v1.8.0_101 +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111 +1 checkstyle 0m 15s branch-2 passed +1 mvnsite 0m 24s branch-2 passed +1 mvneclipse 0m 15s branch-2 passed +1 findbugs 0m 31s branch-2 passed +1 javadoc 0m 12s branch-2 passed with JDK v1.8.0_101 +1 javadoc 0m 16s branch-2 passed with JDK v1.7.0_111 +1 mvninstall 0m 18s the patch passed +1 compile 0m 12s the patch passed with JDK v1.8.0_101 +1 javac 0m 12s the patch passed +1 compile 0m 16s the patch passed with JDK v1.7.0_111 +1 javac 0m 16s the patch passed +1 checkstyle 0m 12s hadoop-tools/hadoop-distcp: The patch generated 0 new + 51 unchanged - 1 fixed = 51 total (was 52) +1 mvnsite 0m 21s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 39s the patch passed +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 12s the patch passed with JDK v1.7.0_111 -1 unit 9m 6s hadoop-distcp in the patch failed with JDK v1.7.0_111. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 32m 18s Reason Tests JDK v1.8.0_101 Failed junit tests hadoop.tools.TestCopyListing JDK v1.7.0_111 Failed junit tests hadoop.tools.TestCopyListing Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13169 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828422/HADOOP-13169-branch-2-006.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 296911ae3b7b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / 3f36ac9 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HADOOP-Build/10506/artifact/patchprocess/patch-unit-hadoop-tools_hadoop-distcp-jdk1.7.0_111.txt JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10506/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10506/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            stevel@apache.org Steve Loughran added a comment -

            I'll let Chris do the final review.

            Now, some bad news about logging @ debug. For commons logging APIs, that needs to be wrapped by if (LOG.isDebugEnabled() clauses; this skips the expense of building strings which are then never used.

            For classes which use the SLF4J logging APIs, you can get away with uising LOG.debug(), provided the style is

            LOG.debug("Adding {}", fileStatusInfo.fileStatus);
            

            Here the string concat only happens if the log is @ debug level, so less expensive and no longer needs to be wrapped. That's why we get away with this in the s3a classes, which have all been upgraded.

            This leaves you with a choice: wrap the debug statements or move the LOG up to SLF4J, the latter simply by changing the class of the log and its factory, adding the new imports and deleting the old one

            
            

            Logger LOG = LoggerFactory.getLogger(...)
            {code)

            stevel@apache.org Steve Loughran added a comment - I'll let Chris do the final review. Now, some bad news about logging @ debug. For commons logging APIs, that needs to be wrapped by if (LOG.isDebugEnabled() clauses; this skips the expense of building strings which are then never used. For classes which use the SLF4J logging APIs, you can get away with uising LOG.debug() , provided the style is LOG.debug( "Adding {}" , fileStatusInfo.fileStatus); Here the string concat only happens if the log is @ debug level, so less expensive and no longer needs to be wrapped. That's why we get away with this in the s3a classes, which have all been upgraded. This leaves you with a choice: wrap the debug statements or move the LOG up to SLF4J, the latter simply by changing the class of the log and its factory, adding the new imports and deleting the old one Logger LOG = LoggerFactory.getLogger(...) {code)

            Thanks stevel@apache.org. Added isDebugEnabled() to be consistent with rest of the code in the latest patch.

            rajesh.balamohan Rajesh Balamohan added a comment - Thanks stevel@apache.org . Added isDebugEnabled() to be consistent with rest of the code in the latest patch.
            cnauroth Chris Nauroth added a comment -

            I think ArrayList would be a better choice for this algorithm than LinkedList. Here is a quote from the JavaDocs of Collections#shuffle.

            This method runs in linear time. If the specified list does not implement the RandomAccess interface and is large, this implementation dumps the specified list into an array before shuffling it, and dumps the shuffled array back into the list. This avoids the quadratic behavior that would result from shuffling a "sequential access" list in place.

            Since a linked list cannot offer constant-time indexing, the shuffle must copy it into a separate array to satisfy its own linear-time guarantee. With an array, we'd have constant-time indexing, so we'd avoid the extra memory footprint of copying the list.

            If its smaller list, there are higher chances of getting the same result. With more items (increased to 100 now), it might not be the case.

            I don't think we can commit a test with any non-zero chance of intermittent failure if the shuffle results in the same input list. We have a problem with managing intermittently failing tests in Hadoop, so I'd prefer not to add more.

            The only solution I can think of is to generate a random seed in the constructor, save it as a member variable, and then use it to build a Random to pass to Collections#shuffle(List, Random). The test could retrieve the seed by calling a VisibleForTesting method, make its own Random, and then make its own call to Collections#shuffle(List, Random). Since both product code and test code would use the same seed, the 2 shuffles would yield the same results.

            I realize that makes the code more awkward though. Let me know if you have a better idea.

            cnauroth Chris Nauroth added a comment - I think ArrayList would be a better choice for this algorithm than LinkedList . Here is a quote from the JavaDocs of Collections#shuffle . This method runs in linear time. If the specified list does not implement the RandomAccess interface and is large, this implementation dumps the specified list into an array before shuffling it, and dumps the shuffled array back into the list. This avoids the quadratic behavior that would result from shuffling a "sequential access" list in place. Since a linked list cannot offer constant-time indexing, the shuffle must copy it into a separate array to satisfy its own linear-time guarantee. With an array, we'd have constant-time indexing, so we'd avoid the extra memory footprint of copying the list. If its smaller list, there are higher chances of getting the same result. With more items (increased to 100 now), it might not be the case. I don't think we can commit a test with any non-zero chance of intermittent failure if the shuffle results in the same input list. We have a problem with managing intermittently failing tests in Hadoop, so I'd prefer not to add more. The only solution I can think of is to generate a random seed in the constructor, save it as a member variable, and then use it to build a Random to pass to Collections#shuffle(List, Random) . The test could retrieve the seed by calling a VisibleForTesting method, make its own Random , and then make its own call to Collections#shuffle(List, Random) . Since both product code and test code would use the same seed, the 2 shuffles would yield the same results. I realize that makes the code more awkward though. Let me know if you have a better idea.

            Thank you cnauroth.

            Changes in the latest patch:
            1. Changed LinkedList to ArrayList in SimpleCopyListing
            2. For the test, I thought of having guava's Ordering.arbitrary() which relies on System.identityHashCode, but that is also prone to collisions (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6809470). Instead, using setSeedForRandomListing(long seed) with "@VisibleForTesting" for testing purpose.

            rajesh.balamohan Rajesh Balamohan added a comment - Thank you cnauroth . Changes in the latest patch: 1. Changed LinkedList to ArrayList in SimpleCopyListing 2. For the test, I thought of having guava's Ordering.arbitrary() which relies on System.identityHashCode , but that is also prone to collisions ( http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6809470 ). Instead, using setSeedForRandomListing(long seed) with "@VisibleForTesting" for testing purpose.
            hadoopqa Hadoop QA added a comment -
            +1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 29s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
            +1 mvninstall 6m 55s branch-2 passed
            +1 compile 0m 16s branch-2 passed with JDK v1.8.0_101
            +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111
            +1 checkstyle 0m 16s branch-2 passed
            +1 mvnsite 0m 24s branch-2 passed
            +1 mvneclipse 0m 15s branch-2 passed
            +1 findbugs 0m 31s branch-2 passed
            +1 javadoc 0m 12s branch-2 passed with JDK v1.8.0_101
            +1 javadoc 0m 14s branch-2 passed with JDK v1.7.0_111
            +1 mvninstall 0m 19s the patch passed
            +1 compile 0m 14s the patch passed with JDK v1.8.0_101
            +1 javac 0m 14s the patch passed
            +1 compile 0m 16s the patch passed with JDK v1.7.0_111
            +1 javac 0m 16s the patch passed
            +1 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 0 new + 51 unchanged - 1 fixed = 51 total (was 52)
            +1 mvnsite 0m 22s the patch passed
            +1 mvneclipse 0m 12s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 0m 40s the patch passed
            +1 javadoc 0m 10s the patch passed with JDK v1.8.0_101
            +1 javadoc 0m 12s the patch passed with JDK v1.7.0_111
            +1 unit 7m 35s hadoop-distcp in the patch passed with JDK v1.7.0_111.
            +1 asflicense 0m 18s The patch does not generate ASF License warnings.
            30m 25s



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:b59b8b7
            JIRA Issue HADOOP-13169
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828788/HADOOP-13169-branch-2-009.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux ef6fcd7a7e98 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision branch-2 / ee101e4
            Default Java 1.7.0_111
            Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
            findbugs v3.0.0
            JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10526/testReport/
            modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
            Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10526/console
            Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 29s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 55s branch-2 passed +1 compile 0m 16s branch-2 passed with JDK v1.8.0_101 +1 compile 0m 19s branch-2 passed with JDK v1.7.0_111 +1 checkstyle 0m 16s branch-2 passed +1 mvnsite 0m 24s branch-2 passed +1 mvneclipse 0m 15s branch-2 passed +1 findbugs 0m 31s branch-2 passed +1 javadoc 0m 12s branch-2 passed with JDK v1.8.0_101 +1 javadoc 0m 14s branch-2 passed with JDK v1.7.0_111 +1 mvninstall 0m 19s the patch passed +1 compile 0m 14s the patch passed with JDK v1.8.0_101 +1 javac 0m 14s the patch passed +1 compile 0m 16s the patch passed with JDK v1.7.0_111 +1 javac 0m 16s the patch passed +1 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 0 new + 51 unchanged - 1 fixed = 51 total (was 52) +1 mvnsite 0m 22s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 40s the patch passed +1 javadoc 0m 10s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 12s the patch passed with JDK v1.7.0_111 +1 unit 7m 35s hadoop-distcp in the patch passed with JDK v1.7.0_111. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 30m 25s Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13169 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828788/HADOOP-13169-branch-2-009.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ef6fcd7a7e98 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / ee101e4 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10526/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10526/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            stevel@apache.org Steve Loughran added a comment -

            *I like the use of a common random seed for the tests, stops us having something else to ignore in jenkins

            • validateFinalListing can use try-with-resources for its cleanup
            stevel@apache.org Steve Loughran added a comment - *I like the use of a common random seed for the tests, stops us having something else to ignore in jenkins validateFinalListing can use try-with-resources for its cleanup
            cnauroth Chris Nauroth added a comment -

            rajesh.balamohan, thanks for the update. I think this is really close.

            In addition to Steve's feedback, would you please switch one remaining instance of LinkedList to ArrayList?

            After that, I'll put this through a full test run against multiple file systems, and then I expect we can commit.

            cnauroth Chris Nauroth added a comment - rajesh.balamohan , thanks for the update. I think this is really close. In addition to Steve's feedback, would you please switch one remaining instance of LinkedList to ArrayList ? After that, I'll put this through a full test run against multiple file systems, and then I expect we can commit.

            Thank you stevel@apache.org, cnauroth. Attached the revised the patch to address the review comments.

            rajesh.balamohan Rajesh Balamohan added a comment - Thank you stevel@apache.org , cnauroth . Attached the revised the patch to address the review comments.
            hadoopqa Hadoop QA added a comment -
            +1 overall



            Vote Subsystem Runtime Comment
            0 reexec 0m 23s Docker mode activated.
            +1 @author 0m 0s The patch does not contain any @author tags.
            +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
            +1 mvninstall 6m 40s branch-2 passed
            +1 compile 0m 17s branch-2 passed with JDK v1.8.0_101
            +1 compile 0m 18s branch-2 passed with JDK v1.7.0_111
            +1 checkstyle 0m 16s branch-2 passed
            +1 mvnsite 0m 24s branch-2 passed
            +1 mvneclipse 0m 14s branch-2 passed
            +1 findbugs 0m 30s branch-2 passed
            +1 javadoc 0m 13s branch-2 passed with JDK v1.8.0_101
            +1 javadoc 0m 14s branch-2 passed with JDK v1.7.0_111
            +1 mvninstall 0m 18s the patch passed
            +1 compile 0m 13s the patch passed with JDK v1.8.0_101
            +1 javac 0m 13s the patch passed
            +1 compile 0m 17s the patch passed with JDK v1.7.0_111
            +1 javac 0m 17s the patch passed
            +1 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 0 new + 51 unchanged - 1 fixed = 51 total (was 52)
            +1 mvnsite 0m 22s the patch passed
            +1 mvneclipse 0m 12s the patch passed
            +1 whitespace 0m 0s The patch has no whitespace issues.
            +1 findbugs 0m 40s the patch passed
            +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101
            +1 javadoc 0m 13s the patch passed with JDK v1.7.0_111
            +1 unit 7m 28s hadoop-distcp in the patch passed with JDK v1.7.0_111.
            +1 asflicense 0m 16s The patch does not generate ASF License warnings.
            30m 8s



            Subsystem Report/Notes
            Docker Image:yetus/hadoop:b59b8b7
            JIRA Issue HADOOP-13169
            JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828939/HADOOP-13169-branch-2-010.patch
            Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
            uname Linux f0a539de7873 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
            Build tool maven
            Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
            git revision branch-2 / b03a0be
            Default Java 1.7.0_111
            Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
            findbugs v3.0.0
            JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10529/testReport/
            modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
            Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10529/console
            Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

            This message was automatically generated.

            hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 40s branch-2 passed +1 compile 0m 17s branch-2 passed with JDK v1.8.0_101 +1 compile 0m 18s branch-2 passed with JDK v1.7.0_111 +1 checkstyle 0m 16s branch-2 passed +1 mvnsite 0m 24s branch-2 passed +1 mvneclipse 0m 14s branch-2 passed +1 findbugs 0m 30s branch-2 passed +1 javadoc 0m 13s branch-2 passed with JDK v1.8.0_101 +1 javadoc 0m 14s branch-2 passed with JDK v1.7.0_111 +1 mvninstall 0m 18s the patch passed +1 compile 0m 13s the patch passed with JDK v1.8.0_101 +1 javac 0m 13s the patch passed +1 compile 0m 17s the patch passed with JDK v1.7.0_111 +1 javac 0m 17s the patch passed +1 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 0 new + 51 unchanged - 1 fixed = 51 total (was 52) +1 mvnsite 0m 22s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 40s the patch passed +1 javadoc 0m 9s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 13s the patch passed with JDK v1.7.0_111 +1 unit 7m 28s hadoop-distcp in the patch passed with JDK v1.7.0_111. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 30m 8s Subsystem Report/Notes Docker Image:yetus/hadoop:b59b8b7 JIRA Issue HADOOP-13169 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828939/HADOOP-13169-branch-2-010.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f0a539de7873 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2 / b03a0be Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/10529/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/10529/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
            cnauroth Chris Nauroth added a comment - - edited

            +1 for patch 010. I verified that TestAzureNativeContractDistCp and ITestS3AContractDistCp are passing. Rajesh, thank you for the patch. Steve, thank you for the code review.

            I have committed this to trunk, branch-2 and branch-2.8.

            cnauroth Chris Nauroth added a comment - - edited +1 for patch 010. I verified that TestAzureNativeContractDistCp and ITestS3AContractDistCp are passing. Rajesh, thank you for the patch. Steve, thank you for the code review. I have committed this to trunk, branch-2 and branch-2.8.
            hudson Hudson added a comment -

            SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10462 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10462/)
            HADOOP-13169. Randomize file list in SimpleCopyListing. Contributed by (cnauroth: rev 98bdb5139769eb55893971b43b9c23da9513a784)

            • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java
            • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
            • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestCopyListing.java
            hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10462 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10462/ ) HADOOP-13169 . Randomize file list in SimpleCopyListing. Contributed by (cnauroth: rev 98bdb5139769eb55893971b43b9c23da9513a784) (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpConstants.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestCopyListing.java

            People

              rajesh.balamohan Rajesh Balamohan
              rajesh.balamohan Rajesh Balamohan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: