Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.21.0
    • Fix Version/s: 2.9.0, 3.0.0-alpha4
    • Component/s: tools/distcp
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      If a positive value is passed to command line switch -blocksperchunk, files with more blocks than this value will be split into chunks of `<blocksperchunk>` blocks to be transferred in parallel, and reassembled on the destination. By default, `<blocksperchunk>` is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when both the source file system supports getBlockLocations and target supports concat.
      Show
      If a positive value is passed to command line switch -blocksperchunk, files with more blocks than this value will be split into chunks of `<blocksperchunk>` blocks to be transferred in parallel, and reassembled on the destination. By default, `<blocksperchunk>` is 0 and the files will be transmitted in their entirety without splitting. This switch is only applicable when both the source file system supports getBlockLocations and target supports concat.

      Description

      The minimum unit of work for a distcp task is a file. We have files that are greater than 1 TB with a block size of 1 GB. If we use distcp to copy these files, the tasks either take a long long long time or finally fails. A better way for distcp would be to copy all the source blocks in parallel, and then stich the blocks back to files at the destination via the HDFS Concat API (HDFS-222)

      1. HADOOP-11794.001.patch
        52 kB
        Yongjun Zhang
      2. HADOOP-11794.002.patch
        58 kB
        Yongjun Zhang
      3. HADOOP-11794.003.patch
        61 kB
        Yongjun Zhang
      4. HADOOP-11794.004.patch
        62 kB
        Yongjun Zhang
      5. HADOOP-11794.005.patch
        62 kB
        Yongjun Zhang
      6. HADOOP-11794.006.patch
        63 kB
        Yongjun Zhang
      7. HADOOP-11794.007.patch
        70 kB
        Yongjun Zhang
      8. HADOOP-11794.008.patch
        70 kB
        Yongjun Zhang
      9. HADOOP-11794.009.patch
        70 kB
        Yongjun Zhang
      10. HADOOP-11794.010.branch2.002.patch
        71 kB
        Yongjun Zhang
      11. HADOOP-11794.010.branch2.patch
        70 kB
        Yongjun Zhang
      12. HADOOP-11794.010.patch
        70 kB
        Yongjun Zhang
      13. MAPREDUCE-2257.patch
        62 kB
        Rosie Li

        Issue Links

          Activity

          Hide
          aw Allen Wittenauer added a comment -

          Won't changing the unit break hftp?

          Show
          aw Allen Wittenauer added a comment - Won't changing the unit break hftp?
          Hide
          dhruba dhruba borthakur added a comment -

          A new option to distcp could trigger parallel-block copy. It cannot be used with hftp.

          Show
          dhruba dhruba borthakur added a comment - A new option to distcp could trigger parallel-block copy. It cannot be used with hftp.
          Hide
          gvenugo gopikannan venugopalsamy added a comment -

          Hello,
          I wish to contribute to this issue but I am new to this project.Can you guys give some tips for where to start from

          Show
          gvenugo gopikannan venugopalsamy added a comment - Hello, I wish to contribute to this issue but I am new to this project.Can you guys give some tips for where to start from
          Hide
          nikhilp nikhil added a comment -

          Is anyone working on this feature?

          Show
          nikhilp nikhil added a comment - Is anyone working on this feature?
          Hide
          gvenugo gopikannan venugopalsamy added a comment -

          I wanna work on this, hey nikhil .. would you like to discuss

          Show
          gvenugo gopikannan venugopalsamy added a comment - I wanna work on this, hey nikhil .. would you like to discuss
          Hide
          rosieli Rosie Li added a comment -

          I'm working on this feature right now. Already done writing the code. Testing now.

          Show
          rosieli Rosie Li added a comment - I'm working on this feature right now. Already done writing the code. Testing now.
          Hide
          rosieli Rosie Li added a comment -

          By default, distcp.copy.by.chunk is set to true in the configuration. The user can set it to false to use the original distcp. But the type of destination will be checked afterward. distcp.copy.by.chunk will remain true only if the destination file system is the distributed file system.

          Show
          rosieli Rosie Li added a comment - By default, distcp.copy.by.chunk is set to true in the configuration. The user can set it to false to use the original distcp. But the type of destination will be checked afterward. distcp.copy.by.chunk will remain true only if the destination file system is the distributed file system.
          Hide
          rosieli Rosie Li added a comment -

          chop files into chunks before copy, and stitch them back after copy.

          Show
          rosieli Rosie Li added a comment - chop files into chunks before copy, and stitch them back after copy.
          Hide
          rosieli Rosie Li added a comment -

          chop files into chunks before copy and then stitch them back after copy

          Show
          rosieli Rosie Li added a comment - chop files into chunks before copy and then stitch them back after copy
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12474807/MAPREDUCE-2257.patch
          against trunk revision 1082703.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The patch appears to cause tar ant target to fail.

          -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:

          -1 contrib tests. The patch failed contrib unit tests.

          -1 system test framework. The patch failed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/148//testReport/
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/148//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474807/MAPREDUCE-2257.patch against trunk revision 1082703. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs (version 1.3.9) to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: -1 contrib tests. The patch failed contrib unit tests. -1 system test framework. The patch failed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/148//testReport/ Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/148//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12474806/MAPREDUCE-2257.patch
          against trunk revision 1082703.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12474806/MAPREDUCE-2257.patch against trunk revision 1082703. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/147//console This message is automatically generated.
          Hide
          rosieli Rosie Li added a comment -

          imported the missing package

          Show
          rosieli Rosie Li added a comment - imported the missing package
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12475059/MAPREDUCE-2257.patch
          against trunk revision 1087098.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings).

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12475059/MAPREDUCE-2257.patch against trunk revision 1087098. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). -1 findbugs. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/150//console This message is automatically generated.
          Hide
          rosieli Rosie Li added a comment -

          Need to fix the bug in the concat method before using the parallel distcp

          Show
          rosieli Rosie Li added a comment - Need to fix the bug in the concat method before using the parallel distcp
          Hide
          rosieli Rosie Li added a comment -

          fix the findbug warning

          Show
          rosieli Rosie Li added a comment - fix the findbug warning
          Hide
          aw Allen Wittenauer added a comment -

          >By default, distcp.copy.by.chunk is set to true in the configuration. The user can set it to >false to use the original distcp. But the type of destination will be checked afterward. >distcp.copy.by.chunk will remain true only if the destination file system is the distributed >file system.

          This needs to get added to the release notes.

          Show
          aw Allen Wittenauer added a comment - >By default, distcp.copy.by.chunk is set to true in the configuration. The user can set it to >false to use the original distcp. But the type of destination will be checked afterward. >distcp.copy.by.chunk will remain true only if the destination file system is the distributed >file system. This needs to get added to the release notes.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12475126/MAPREDUCE-2257.patch
          against trunk revision 1087098.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12475126/MAPREDUCE-2257.patch against trunk revision 1087098. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/152//console This message is automatically generated.
          Hide
          rosieli Rosie Li added a comment -

          The failure of the contrib test is not related to the new distcp.

          Show
          rosieli Rosie Li added a comment - The failure of the contrib test is not related to the new distcp.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          The class FileChunkPair is not really a pair, right? It stores 5 fields.

          Can't we somehow unify the if/else in copy()? At least doCopyFile() could use doCopyFileChunks().

          Show
          rschmidt Rodrigo Schmidt added a comment - The class FileChunkPair is not really a pair, right? It stores 5 fields. Can't we somehow unify the if/else in copy()? At least doCopyFile() could use doCopyFileChunks().
          Hide
          rosieli Rosie Li added a comment -

          FileChunkPair is still src/dst file pairs but with the other 3 fields telling the starting point and offset for the file chunk pairs
          Also I merged doCopyFile() and doCopyFileChunks(), now we only have one doCopyFile method.

          Show
          rosieli Rosie Li added a comment - FileChunkPair is still src/dst file pairs but with the other 3 fields telling the starting point and offset for the file chunk pairs Also I merged doCopyFile() and doCopyFileChunks(), now we only have one doCopyFile method.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476634/MAPREDUCE-2257.patch
          against trunk revision 1094093.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 4 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings).

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476634/MAPREDUCE-2257.patch against trunk revision 1094093. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 4 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The applied patch generated 2256 javac compiler warnings (more than the trunk's current 2244 warnings). +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-MAPREDUCE-Build/173//console This message is automatically generated.
          Hide
          rosieli Rosie Li added a comment -

          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:59: warning: [deprecation] org.apache.hadoop.mapred.FileSplit in org.apache.hadoop.mapred has been deprecated
          [javac] import org.apache.hadoop.mapred.FileSplit;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:60: warning: [deprecation] org.apache.hadoop.mapred.InputFormat in org.apache.hadoop.mapred has been deprecated
          [javac] import org.apache.hadoop.mapred.InputFormat;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:61: warning: [deprecation] org.apache.hadoop.mapred.InputSplit in org.apache.hadoop.mapred has been deprecated
          [javac] import org.apache.hadoop.mapred.InputSplit;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:63: warning: [deprecation] org.apache.hadoop.mapred.JobClient in org.apache.hadoop.mapred has been deprecated
          [javac] import org.apache.hadoop.mapred.JobClient;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:64: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] import org.apache.hadoop.mapred.JobConf;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:66: warning: [deprecation] org.apache.hadoop.mapred.Mapper in org.apache.hadoop.mapred has been deprecated
          [javac] import org.apache.hadoop.mapred.Mapper;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:211: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] private JobConf conf;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:738: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] private static void checkSrcPath(JobConf jobConf, List<Path> srcPaths)
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:831: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] static private void finalize(Configuration conf, JobConf jobconf,
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1096: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] private static int setMapCount(long totalBytes, JobConf job)
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1120: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] private static JobConf createJobConf(Configuration conf) {
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1148: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] private static void setReplication(Configuration conf, JobConf jobConf,
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1190: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] static boolean setup(Configuration conf, JobConf jobConf,
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1562: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] FileSystem jobfs, Path jobdir, JobConf jobconf, Configuration conf
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:257: warning: [deprecation] org.apache.hadoop.mapred.InputFormat in org.apache.hadoop.mapred has been deprecated
          [javac] static class CopyInputFormat implements InputFormat<Text, Text> {
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:265: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] public InputSplit[] getSplits(JobConf job, int numSplits)
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:265: warning: [deprecation] org.apache.hadoop.mapred.InputSplit in org.apache.hadoop.mapred has been deprecated
          [javac] public InputSplit[] getSplits(JobConf job, int numSplits)
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:316: warning: [deprecation] org.apache.hadoop.mapred.InputSplit in org.apache.hadoop.mapred has been deprecated
          [javac] public RecordReader<Text, Text> getRecordReader(InputSplit split,
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:317: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] JobConf job, Reporter reporter) throws IOException {
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:326: warning: [deprecation] org.apache.hadoop.mapred.Mapper in org.apache.hadoop.mapred has been deprecated
          [javac] implements Mapper<LongWritable, FilePair, WritableComparable<?>, Text> {
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:337: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] private JobConf job;
          [javac] ^
          [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:617: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated
          [javac] public void configure(JobConf job)
          For the warning added, they are all from using deprecated class.

          Show
          rosieli Rosie Li added a comment - [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:59: warning: [deprecation] org.apache.hadoop.mapred.FileSplit in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.FileSplit; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:60: warning: [deprecation] org.apache.hadoop.mapred.InputFormat in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.InputFormat; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:61: warning: [deprecation] org.apache.hadoop.mapred.InputSplit in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.InputSplit; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:63: warning: [deprecation] org.apache.hadoop.mapred.JobClient in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.JobClient; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:64: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.JobConf; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:66: warning: [deprecation] org.apache.hadoop.mapred.Mapper in org.apache.hadoop.mapred has been deprecated [javac] import org.apache.hadoop.mapred.Mapper; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:211: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private JobConf conf; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:738: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static void checkSrcPath(JobConf jobConf, List<Path> srcPaths) [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:831: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] static private void finalize(Configuration conf, JobConf jobconf, [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1096: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static int setMapCount(long totalBytes, JobConf job) [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1120: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static JobConf createJobConf(Configuration conf) { [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1148: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private static void setReplication(Configuration conf, JobConf jobConf, [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1190: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] static boolean setup(Configuration conf, JobConf jobConf, [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:1562: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] FileSystem jobfs, Path jobdir, JobConf jobconf, Configuration conf [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:257: warning: [deprecation] org.apache.hadoop.mapred.InputFormat in org.apache.hadoop.mapred has been deprecated [javac] static class CopyInputFormat implements InputFormat<Text, Text> { [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:265: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] public InputSplit[] getSplits(JobConf job, int numSplits) [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:265: warning: [deprecation] org.apache.hadoop.mapred.InputSplit in org.apache.hadoop.mapred has been deprecated [javac] public InputSplit[] getSplits(JobConf job, int numSplits) [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:316: warning: [deprecation] org.apache.hadoop.mapred.InputSplit in org.apache.hadoop.mapred has been deprecated [javac] public RecordReader<Text, Text> getRecordReader(InputSplit split, [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:317: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] JobConf job, Reporter reporter) throws IOException { [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:326: warning: [deprecation] org.apache.hadoop.mapred.Mapper in org.apache.hadoop.mapred has been deprecated [javac] implements Mapper<LongWritable, FilePair, WritableComparable<?>, Text> { [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:337: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] private JobConf job; [javac] ^ [javac] /data/users/rosieli/hadoop_jira/hadoop-mapred-trunk/src/tools/org/apache/hadoop/tools/DistCp.java:617: warning: [deprecation] org.apache.hadoop.mapred.JobConf in org.apache.hadoop.mapred has been deprecated [javac] public void configure(JobConf job) For the warning added, they are all from using deprecated class.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Shouldn't you change your code to use the class that replaced the deprecated one?

          Show
          rschmidt Rodrigo Schmidt added a comment - Shouldn't you change your code to use the class that replaced the deprecated one?
          Hide
          rosieli Rosie Li added a comment -

          the original code was using the deprecated one......like the JobConf, InputSplit....

          Show
          rosieli Rosie Li added a comment - the original code was using the deprecated one......like the JobConf, InputSplit....
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          Maybe it's time to change it to non-deprecated classes.

          Show
          rschmidt Rodrigo Schmidt added a comment - Maybe it's time to change it to non-deprecated classes.
          Hide
          rosieli Rosie Li added a comment -

          made change to methods that use deprecated classes.

          Show
          rosieli Rosie Li added a comment - made change to methods that use deprecated classes.
          Hide
          rschmidt Rodrigo Schmidt added a comment -

          +1
          Patch looks good. Just make sure it passes the QA test. Hadoop QA doesn't seem to have picked up the latest version.

          Show
          rschmidt Rodrigo Schmidt added a comment - +1 Patch looks good. Just make sure it passes the QA test. Hadoop QA doesn't seem to have picked up the latest version.
          Hide
          revans2 Robert Joseph Evans added a comment -

          Canceling the patch as it is rather old, and does not apply to trunk any longer. Dhruba, this patch looks like it has a lot of promise to speed things up during a distcp of large files. If you no longer want to work on this patch please indicate it so that someone else can pick it up. If you do want to work on it I would be happy to review it and commit it after your upmerge.

          Show
          revans2 Robert Joseph Evans added a comment - Canceling the patch as it is rather old, and does not apply to trunk any longer. Dhruba, this patch looks like it has a lot of promise to speed things up during a distcp of large files. If you no longer want to work on this patch please indicate it so that someone else can pick it up. If you do want to work on it I would be happy to review it and commit it after your upmerge.
          Hide
          dhruba dhruba borthakur added a comment -

          This patch really speeded up distcp. However I am unable to work on this at present. If somebody can take over, that will be great, otherwise I will get back to this one sometime soon.

          Show
          dhruba dhruba borthakur added a comment - This patch really speeded up distcp. However I am unable to work on this at present. If somebody can take over, that will be great, otherwise I will get back to this one sometime soon.
          Hide
          mithun Mithun Radhakrishnan added a comment -

          I'll take a look. I already have a patch that accomplishes the bulk of this. The finishing touches remain.

          I'll post a patch shortly.

          Show
          mithun Mithun Radhakrishnan added a comment - I'll take a look. I already have a patch that accomplishes the bulk of this. The finishing touches remain. I'll post a patch shortly.
          Hide
          mahadev Mahadev konar added a comment -

          Thanks for taking this up Mithun!

          Show
          mahadev Mahadev konar added a comment - Thanks for taking this up Mithun!
          Hide
          qwertymaniac Harsh J added a comment -

          Mithun Radhakrishnan - I know its been a while, but are you still working on this?

          Since HDFS-222 is getting some attention, I feel it would be good to have this as an inbuilt usage of the same (and since Dhruba has already mentioned it is a great improvement to DistCp).

          Show
          qwertymaniac Harsh J added a comment - Mithun Radhakrishnan - I know its been a while, but are you still working on this? Since HDFS-222 is getting some attention, I feel it would be good to have this as an inbuilt usage of the same (and since Dhruba has already mentioned it is a great improvement to DistCp).
          Hide
          mithun Mithun Radhakrishnan added a comment -

          Sorry, I haven't been able to spare the time yet. I'll try make the time, shortly.

          Show
          mithun Mithun Radhakrishnan added a comment - Sorry, I haven't been able to spare the time yet. I'll try make the time, shortly.
          Hide
          mtxrym liuwei added a comment -

          since distcp has distcp2, is there a patch exits for distcp2 to copy blocks in parallel?

          Show
          mtxrym liuwei added a comment - since distcp has distcp2, is there a patch exits for distcp2 to copy blocks in parallel?
          Hide
          yzhangal Yongjun Zhang added a comment -

          Hi Mithun Radhakrishnan, thanks for your earlier work here. Wonder if you will continue to work on this issue? If not, I'm interested in taking it on. Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Hi Mithun Radhakrishnan , thanks for your earlier work here. Wonder if you will continue to work on this issue? If not, I'm interested in taking it on. Thanks.
          Hide
          mithun Mithun Radhakrishnan added a comment -

          Yongjun Zhang: Thank you, sir. Please do. Hive has kept me busy enough not to devote time here. I'd be happy to review your work.

          I had a patch a couple of years ago which split files on block-boundaries, copied them over, and then stitched them together using DistributedFileSystem.concat() in a reduce-step. If I can find the patch, I'll ping it to you, but it's not terribly hard to do this from scratch. The prototype had very promising performance.

          I look forward to your solution.

          Show
          mithun Mithun Radhakrishnan added a comment - Yongjun Zhang : Thank you, sir. Please do. Hive has kept me busy enough not to devote time here. I'd be happy to review your work. I had a patch a couple of years ago which split files on block-boundaries, copied them over, and then stitched them together using DistributedFileSystem.concat() in a reduce-step. If I can find the patch, I'll ping it to you, but it's not terribly hard to do this from scratch. The prototype had very promising performance. I look forward to your solution.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Mithun Radhakrishnan. There is a patch currently attached to this jira, is that the one you referred to in your last comment?

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Mithun Radhakrishnan . There is a patch currently attached to this jira, is that the one you referred to in your last comment?
          Hide
          mithun Mithun Radhakrishnan added a comment -

          Sorry, no. That's likely dhruba borthakur's work, which might have been based on the DistCp-v1 code. We'll need new code for the DistCp-v2 code (i.e. my rewrite from MAPREDUCE-2765).

          Apologies if you've already thought this through. One would need to change the DynamicInputFormat#createSplits() implementation, which currently looks thus:

            private List<InputSplit> createSplits(JobContext jobContext,
                                                  List<DynamicInputChunk> chunks)
                    throws IOException {
              int numMaps = getNumMapTasks(jobContext.getConfiguration());
          
              final int nSplits = Math.min(numMaps, chunks.size());
              List<InputSplit> splits = new ArrayList<InputSplit>(nSplits);
              
              for (int i=0; i< nSplits; ++i) {
                TaskID taskId = new TaskID(jobContext.getJobID(), TaskType.MAP, i);
                chunks.get(i).assignTo(taskId);
                splits.add(new FileSplit(chunks.get(i).getPath(), 0,
                    // Setting non-zero length for FileSplit size, to avoid a possible
                    // future when 0-sized file-splits are considered "empty" and skipped
                    // over.
                    getMinRecordsPerChunk(jobContext.getConfiguration()),
                    null));
              }
              DistCpUtils.publish(jobContext.getConfiguration(),
                                  CONF_LABEL_NUM_SPLITS, splits.size());
              return splits;
            }
          

          You'll need to create a FileSplit per file-block (by first examining the file's block-size). The mappers will now need to emit something like (relativePathForOriginalSourceFile, targetLocation_with_block_number). By keying on the relative-source-paths (+ expected number of blocks), you can get all the target-block-locations to hit the same reducer, where you can stitch them together.

          Good luck. :]

          Show
          mithun Mithun Radhakrishnan added a comment - Sorry, no. That's likely dhruba borthakur 's work, which might have been based on the DistCp-v1 code. We'll need new code for the DistCp-v2 code (i.e. my rewrite from MAPREDUCE-2765 ). Apologies if you've already thought this through. One would need to change the DynamicInputFormat#createSplits() implementation, which currently looks thus: private List<InputSplit> createSplits(JobContext jobContext, List<DynamicInputChunk> chunks) throws IOException { int numMaps = getNumMapTasks(jobContext.getConfiguration()); final int nSplits = Math .min(numMaps, chunks.size()); List<InputSplit> splits = new ArrayList<InputSplit>(nSplits); for ( int i=0; i< nSplits; ++i) { TaskID taskId = new TaskID(jobContext.getJobID(), TaskType.MAP, i); chunks.get(i).assignTo(taskId); splits.add( new FileSplit(chunks.get(i).getPath(), 0, // Setting non-zero length for FileSplit size, to avoid a possible // future when 0-sized file-splits are considered "empty" and skipped // over. getMinRecordsPerChunk(jobContext.getConfiguration()), null )); } DistCpUtils.publish(jobContext.getConfiguration(), CONF_LABEL_NUM_SPLITS, splits.size()); return splits; } You'll need to create a FileSplit per file-block (by first examining the file's block-size). The mappers will now need to emit something like (relativePathForOriginalSourceFile, targetLocation_with_block_number) . By keying on the relative-source-paths (+ expected number of blocks), you can get all the target-block-locations to hit the same reducer, where you can stitch them together. Good luck. :]
          Hide
          dhruba dhruba borthakur added a comment -

          Thanks for offering to pick this up Yongjun Zhang!

          Show
          dhruba dhruba borthakur added a comment - Thanks for offering to pick this up Yongjun Zhang!
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Mithun Radhakrishnan and dhruba borthakur!

          There will be some complexity with regard to block size since we now support variable size block (introduced by the append feature). We might need ask NN for the size of all blocks a file has, and avoid have the split boundary at the middle of a block. Another possibility is, to split the block into two if it happens (since now we support multiple size block), I have not looked deeper at this yet.

          And I'm thinking we could have one FileSplit per multiple file-blocks, we can make it as an input option to distcp.

          Thanks again.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Mithun Radhakrishnan and dhruba borthakur ! There will be some complexity with regard to block size since we now support variable size block (introduced by the append feature). We might need ask NN for the size of all blocks a file has, and avoid have the split boundary at the middle of a block. Another possibility is, to split the block into two if it happens (since now we support multiple size block), I have not looked deeper at this yet. And I'm thinking we could have one FileSplit per multiple file-blocks, we can make it as an input option to distcp. Thanks again.
          Hide
          yzhangal Yongjun Zhang added a comment -

          BTW guys, I was thinking (some ideas are inspired by the patch attached to this jira):

          1. when creating file listing which used to have one entry like <srcFile, CopyListingFileStatus> for file srcFile, we can create multiple entries for the same file, each entry representing a chunk of the file. And include <offset, chunkLength> as two new members of class CopyListingFileStatus.

          2. we can do some change to the following code in UniformSizeInputFormat.java (probably enabled with a command line switch). In the code below. Right now we check whether including a file would exceed the bytesPerSplit, and decide whether to include the split. Instead of check the file length, we check the chunk length. If including the chunk make exceed the bytesPerSplit, then we don't include chunk in the split. Otherwise, we include.

          We could probably introduce a new ChunkUniformSizeInputFormat class instead of modifying the current one.

            private List<InputSplit> getSplits(Configuration configuration, int numSplits,
                                               long totalSizeBytes) throws IOException {
              List<InputSplit> splits = new ArrayList<InputSplit>(numSplits);
              long nBytesPerSplit = (long) Math.ceil(totalSizeBytes * 1.0 / numSplits);
          
              CopyListingFileStatus srcFileStatus = new CopyListingFileStatus();
              Text srcRelPath = new Text();
              long currentSplitSize = 0;
              long lastSplitStart = 0;
              long lastPosition = 0;
          
              final Path listingFilePath = getListingFilePath(configuration);
          
              if (LOG.isDebugEnabled()) {
                LOG.debug("Average bytes per map: " + nBytesPerSplit +
                    ", Number of maps: " + numSplits + ", total size: " + totalSizeBytes);
              }
              SequenceFile.Reader reader=null;
              try {
                reader = getListingFileReader(configuration);
                while (reader.next(srcRelPath, srcFileStatus)) {
                  // If adding the current file would cause the bytes per map to exceed
                  // limit. Add the current file to new split
                  if (currentSplitSize + srcFileStatus.getLen() > nBytesPerSplit && lastPosition != 0) {
                    FileSplit split = new FileSplit(listingFilePath, lastSplitStart,
                        lastPosition - lastSplitStart, null);
                    if (LOG.isDebugEnabled()) {
                      LOG.debug ("Creating split : " + split + ", bytes in split: " + currentSplitSize);
                    }
                    splits.add(split);
                    lastSplitStart = lastPosition;
                    currentSplitSize = 0;
                  }
                  currentSplitSize += srcFileStatus.getLen();
                  lastPosition = reader.getPosition();
                }
                if (lastPosition > lastSplitStart) {
                  FileSplit split = new FileSplit(listingFilePath, lastSplitStart,
                      lastPosition - lastSplitStart, null);
                  if (LOG.isDebugEnabled()) {
                    LOG.info ("Creating split : " + split + ", bytes in split: " + currentSplitSize);
                  }
                  splits.add(split);
                }
          
              } finally {
                IOUtils.closeStream(reader);
              }
          
              return splits;
            }
          

          3. The CopyMapper is changed to copy each chunk of the big file, with the chunk offset included in the temporary target file name.

          4. Change to CopyCommitter to load the copylisting file and iterate through, and stitch the segments of the same file to the target file.
          This work is not quite distributed. A more elegant solution would be to use reducers to do the stitching as Mithun suggested.

          Wonder if this makes sense to you guys.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - BTW guys, I was thinking (some ideas are inspired by the patch attached to this jira): 1. when creating file listing which used to have one entry like <srcFile, CopyListingFileStatus> for file srcFile, we can create multiple entries for the same file, each entry representing a chunk of the file. And include <offset, chunkLength> as two new members of class CopyListingFileStatus. 2. we can do some change to the following code in UniformSizeInputFormat.java (probably enabled with a command line switch). In the code below. Right now we check whether including a file would exceed the bytesPerSplit, and decide whether to include the split. Instead of check the file length, we check the chunk length. If including the chunk make exceed the bytesPerSplit, then we don't include chunk in the split. Otherwise, we include. We could probably introduce a new ChunkUniformSizeInputFormat class instead of modifying the current one. private List<InputSplit> getSplits(Configuration configuration, int numSplits, long totalSizeBytes) throws IOException { List<InputSplit> splits = new ArrayList<InputSplit>(numSplits); long nBytesPerSplit = ( long ) Math .ceil(totalSizeBytes * 1.0 / numSplits); CopyListingFileStatus srcFileStatus = new CopyListingFileStatus(); Text srcRelPath = new Text(); long currentSplitSize = 0; long lastSplitStart = 0; long lastPosition = 0; final Path listingFilePath = getListingFilePath(configuration); if (LOG.isDebugEnabled()) { LOG.debug( "Average bytes per map: " + nBytesPerSplit + ", Number of maps: " + numSplits + ", total size: " + totalSizeBytes); } SequenceFile.Reader reader= null ; try { reader = getListingFileReader(configuration); while (reader.next(srcRelPath, srcFileStatus)) { // If adding the current file would cause the bytes per map to exceed // limit. Add the current file to new split if (currentSplitSize + srcFileStatus.getLen() > nBytesPerSplit && lastPosition != 0) { FileSplit split = new FileSplit(listingFilePath, lastSplitStart, lastPosition - lastSplitStart, null ); if (LOG.isDebugEnabled()) { LOG.debug ( "Creating split : " + split + ", bytes in split: " + currentSplitSize); } splits.add(split); lastSplitStart = lastPosition; currentSplitSize = 0; } currentSplitSize += srcFileStatus.getLen(); lastPosition = reader.getPosition(); } if (lastPosition > lastSplitStart) { FileSplit split = new FileSplit(listingFilePath, lastSplitStart, lastPosition - lastSplitStart, null ); if (LOG.isDebugEnabled()) { LOG.info ( "Creating split : " + split + ", bytes in split: " + currentSplitSize); } splits.add(split); } } finally { IOUtils.closeStream(reader); } return splits; } 3. The CopyMapper is changed to copy each chunk of the big file, with the chunk offset included in the temporary target file name. 4. Change to CopyCommitter to load the copylisting file and iterate through, and stitch the segments of the same file to the target file. This work is not quite distributed. A more elegant solution would be to use reducers to do the stitching as Mithun suggested. Wonder if this makes sense to you guys. Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          We probably don't need ChunkUniformSizeInputFormat, and just use UniformSizeInputFormat (When we break large file into chunks, it make the split more uniform), when a file doesn't need to be break into chunks, there is a single entry for it in the fileListing, and we make the entry's chunkLength the same as its file length.

          I was thinking that for the initial implementation, we can just change the CopyCommitter, as I described in last comment, instead of introducing a reducer stage for distcp.

          Welcome to comment. Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - We probably don't need ChunkUniformSizeInputFormat, and just use UniformSizeInputFormat (When we break large file into chunks, it make the split more uniform), when a file doesn't need to be break into chunks, there is a single entry for it in the fileListing, and we make the entry's chunkLength the same as its file length. I was thinking that for the initial implementation, we can just change the CopyCommitter, as I described in last comment, instead of introducing a reducer stage for distcp. Welcome to comment. Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Looking at it more, I think we should apply "breaking file into chunks" to both UniformSizeInputFormat and DynamicInputFormat as an improvement to both strategies, and enable it by command line option initially. Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Looking at it more, I think we should apply "breaking file into chunks" to both UniformSizeInputFormat and DynamicInputFormat as an improvement to both strategies, and enable it by command line option initially. Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          HI Mithun Radhakrishnan,

          Some more thinking to share.

          When I commented earlier about "include <offset, chunkLength> as two new members of class CopyListingFileStatus.", I was thinking the offset, chunkLength at bytes level, inspired by your suggestion "You'll need to create a FileSplit per file-block ", I think we can make them blocks.

          That is, we can split the file into chunks, each chunk contains multiple blocks. The chunk is represented as a block range <bgnIdx, numBlocks>, where bgnIdx is the block index of the first block of the chunk, and numBlocks is the number of blocks in the chunk. A degenerated case is what you suggested: one file-block per split. But I'm making it more flexible here, such that we can support variable number blocks per split.

          I'd make the number of blocks per split as a distcp parameter. For a give distcp run, the number of blocks in a split is fixed as specified by the parameter, except for the last split of a file, which might contain fewer blocks. BTW, introduced by "append" feature, a same file may contain blocks of different size, thus it's not always true that each split will be same size in bytes.

          We need some new client-namenode API protocol to get back the locatedBlocks for the specified block range, so the CopyMapper can work on the given block range (possible there will be other application need the similar API). I will create a jira about it.

          BTW, I had quite some fun with distcp, but I did not know who is the author of distcp v2, until working on this jira. Appreciate your excellent work!

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - HI Mithun Radhakrishnan , Some more thinking to share. When I commented earlier about "include <offset, chunkLength> as two new members of class CopyListingFileStatus.", I was thinking the offset, chunkLength at bytes level, inspired by your suggestion "You'll need to create a FileSplit per file-block ", I think we can make them blocks. That is, we can split the file into chunks, each chunk contains multiple blocks. The chunk is represented as a block range <bgnIdx, numBlocks>, where bgnIdx is the block index of the first block of the chunk, and numBlocks is the number of blocks in the chunk. A degenerated case is what you suggested: one file-block per split. But I'm making it more flexible here, such that we can support variable number blocks per split. I'd make the number of blocks per split as a distcp parameter. For a give distcp run, the number of blocks in a split is fixed as specified by the parameter, except for the last split of a file, which might contain fewer blocks. BTW, introduced by "append" feature, a same file may contain blocks of different size, thus it's not always true that each split will be same size in bytes. We need some new client-namenode API protocol to get back the locatedBlocks for the specified block range, so the CopyMapper can work on the given block range (possible there will be other application need the similar API). I will create a jira about it. BTW, I had quite some fun with distcp, but I did not know who is the author of distcp v2, until working on this jira. Appreciate your excellent work! Thanks.
          Hide
          mithun Mithun Radhakrishnan added a comment -

          Yongjun Zhang,

          Appreciate your excellent work!

          You're too kind. :]

          But I'm making it more flexible here, such that we can support variable number blocks per split.

          I agree with the principle of what you're suggesting. Combining multiple splits into a larger split (based on size) is a problem that CombineFileInputFormat provides a solution for. Do you think we can use CombineFileInputFormat to combine block-level splits into a larger split?

          We need some new client-namenode API protocol to get back the locatedBlocks for the specified block range...

          Hmm... Do we? DistCp copies whole files (even if at a split level). Since we can retrieve located blocks for all blocks in the file, shouldn't that be enough? We could group locatedBlocks by block-id. Perhaps I'm missing something.

          Show
          mithun Mithun Radhakrishnan added a comment - Yongjun Zhang , Appreciate your excellent work! You're too kind. :] But I'm making it more flexible here, such that we can support variable number blocks per split. I agree with the principle of what you're suggesting. Combining multiple splits into a larger split (based on size) is a problem that CombineFileInputFormat provides a solution for. Do you think we can use CombineFileInputFormat to combine block-level splits into a larger split? We need some new client-namenode API protocol to get back the locatedBlocks for the specified block range... Hmm... Do we? DistCp copies whole files (even if at a split level). Since we can retrieve located blocks for all blocks in the file, shouldn't that be enough? We could group locatedBlocks by block-id. Perhaps I'm missing something.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Mithun Radhakrishnan!

          Not sure about CombineFileINputFormat, but I will take a look.

          Hmm... Do we? DistCp copies whole files (even if at a split level). Since we can retrieve located blocks for all blocks in the file, shouldn't that be enough? We could group locatedBlocks by block-id. Perhaps I'm missing something.

          Sorry I was not clear. This jira is to avoid copying a large single file within one mapper. What's in my mind is to break large file into block ranges (by a new distcp command line arg), such as (0, 10), (10, 20), ...(100, 4), each entry here is a pair (starting block index, and number of blocks) here, all entries for the same file except the last entry have same number of blocks. So we could assign the entries of the same file to different mappers (to work in parallel). In order to do this, we can have the API I described to fetch back block locations for the block range. My argument is that fetching all block locations for a file is not as efficient as fetching only the block range the mapper is assigned to work on.

          Do you agree that the API would help based on my explanation here? I have done a prototype of the API to fetch block locations of a block range, will try to post it after the holiday. I think there may be other applications that need this kind of API too.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Mithun Radhakrishnan ! Not sure about CombineFileINputFormat , but I will take a look. Hmm... Do we? DistCp copies whole files (even if at a split level). Since we can retrieve located blocks for all blocks in the file, shouldn't that be enough? We could group locatedBlocks by block-id. Perhaps I'm missing something. Sorry I was not clear. This jira is to avoid copying a large single file within one mapper. What's in my mind is to break large file into block ranges (by a new distcp command line arg), such as (0, 10), (10, 20), ...(100, 4), each entry here is a pair (starting block index, and number of blocks) here, all entries for the same file except the last entry have same number of blocks. So we could assign the entries of the same file to different mappers (to work in parallel). In order to do this, we can have the API I described to fetch back block locations for the block range. My argument is that fetching all block locations for a file is not as efficient as fetching only the block range the mapper is assigned to work on. Do you agree that the API would help based on my explanation here? I have done a prototype of the API to fetch block locations of a block range, will try to post it after the holiday. I think there may be other applications that need this kind of API too. Thanks.
          Hide
          mithun Mithun Radhakrishnan added a comment -

          My argument is that fetching all block locations for a file is not as efficient as fetching only the block range the mapper is assigned to work on.

          Thank you for explaining. Let me see if I can phrase my questions more clearly than before:

          1. Would it make sense to include the block-locations within the splits, at the time of split-calculation, instead of the block-ranges? If yes, then we can make do with the API we already have, by fetching locatedBlocks for all files, and grouping them among the DistCp splits. (It is indeed possible that keeping ranges, and using your proposed API on the map-side might be faster. But those map-side calls might possibly also exert more parallel load on the name-node, depending on the number of maps.)
          1. Naive question: Why do we need to identify locatedBlocks? Don't HDFS files have uniformly sized blocks (within a file)? As such, aren't the block-boundaries implicit (i.e. from blockId*blockSize to (blockId+1)*(blockSize) - 1)? Can't we simply copy that range of bytes into a new file (and stitch the new files in reduce)?
          Show
          mithun Mithun Radhakrishnan added a comment - My argument is that fetching all block locations for a file is not as efficient as fetching only the block range the mapper is assigned to work on. Thank you for explaining. Let me see if I can phrase my questions more clearly than before: Would it make sense to include the block-locations within the splits, at the time of split-calculation, instead of the block-ranges? If yes, then we can make do with the API we already have, by fetching locatedBlocks for all files, and grouping them among the DistCp splits. (It is indeed possible that keeping ranges, and using your proposed API on the map-side might be faster. But those map-side calls might possibly also exert more parallel load on the name-node, depending on the number of maps.) Naive question: Why do we need to identify locatedBlocks? Don't HDFS files have uniformly sized blocks (within a file)? As such, aren't the block-boundaries implicit (i.e. from blockId*blockSize to (blockId+1)*(blockSize) - 1 )? Can't we simply copy that range of bytes into a new file (and stitch the new files in reduce)?
          Hide
          yzhangal Yongjun Zhang added a comment -

          HI Mithun Radhakrishnan,

          For your question 2, it's because we now support variable block length, see
          https://issues.apache.org/jira/browse/HDFS-3689?focusedCommentId=14277548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14277548

          For question 1, I'm worried that it will take much longer time to prepare the copy listing if we need to get block locations for each file. That's why I think it's easier to defer this to mapper. As you pointed out, indeed this will incur more communication work between mapper and NN, but since it's split among mappers, and the call to NN would be scattered in the lifespan of distcp job, it should be easier then when we prepare copy listing. Plus, we may make mapper cache some block locations if multiple block-ranges of the same file are assigned to the same mapper.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - HI Mithun Radhakrishnan , For your question 2, it's because we now support variable block length, see https://issues.apache.org/jira/browse/HDFS-3689?focusedCommentId=14277548&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14277548 For question 1, I'm worried that it will take much longer time to prepare the copy listing if we need to get block locations for each file. That's why I think it's easier to defer this to mapper. As you pointed out, indeed this will incur more communication work between mapper and NN, but since it's split among mappers, and the call to NN would be scattered in the lifespan of distcp job, it should be easier then when we prepare copy listing. Plus, we may make mapper cache some block locations if multiple block-ranges of the same file are assigned to the same mapper. Thanks.
          Hide
          mithun Mithun Radhakrishnan added a comment -

          Ah, I finally see. That makes complete sense. Thank you for the pointer to the JIRA.

          Also, CombineFileInputFormat might work with UniformSizeInputFormat, but it might not with DynamicInputFormat. Maybe combining a configurable number of blocks (ranges) into splits would be easier to work with.

          I see what you're doing, and I agree.

          Show
          mithun Mithun Radhakrishnan added a comment - Ah, I finally see. That makes complete sense. Thank you for the pointer to the JIRA. Also, CombineFileInputFormat might work with UniformSizeInputFormat , but it might not with DynamicInputFormat . Maybe combining a configurable number of blocks (ranges) into splits would be easier to work with. I see what you're doing, and I agree.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks much Mithun Radhakrishnan, I will resume working on this after holiday, wish you a nice one!

          Show
          yzhangal Yongjun Zhang added a comment - Thanks much Mithun Radhakrishnan , I will resume working on this after holiday, wish you a nice one!
          Hide
          mmukhi_2 Mahak Mukhi added a comment -

          Yongjun Zhang

          Hey, I was wondering, if you're still om this issue?

          Show
          mmukhi_2 Mahak Mukhi added a comment - Yongjun Zhang Hey, I was wondering, if you're still om this issue?
          Hide
          yzhangal Yongjun Zhang added a comment -

          Yes Mahak Mukhi. Sorry for the delay incurred due to a critical issue. I will update here soon.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Yes Mahak Mukhi . Sorry for the delay incurred due to a critical issue. I will update here soon. Thanks.
          Hide
          mmukhi_2 Mahak Mukhi added a comment -

          No worries, just wanted to check in.

          Show
          mmukhi_2 Mahak Mukhi added a comment - No worries, just wanted to check in.
          Hide
          yzhangal Yongjun Zhang added a comment - - edited

          Sorry for the long delay, attaching patch rev 001.

          With this patch, we can pass -chunksize <x> to distcp, to tell distcp to split large files into chunks, each containing a number of blocks specified by this new parameter, except the last chunk of a file may be smaller. CopyMapper will treat each chunk as a single file so the chunks can be copied in parallel; And the CopyCommitter concat the parts into one target file.

          With this switch, we will enable preserving block size, disable the randomization of entries in the sequence file, disable append feature. We could do further optimization as follow-ups.

          Any review is very welcome!

          Thanks a lot.

          In addition, thanks Wei-Chiu Chuang, Xiao Chen for assisting in an initial draft we did a while back, and the three of us will be contributers of this jira.

          Show
          yzhangal Yongjun Zhang added a comment - - edited Sorry for the long delay, attaching patch rev 001. With this patch, we can pass -chunksize <x> to distcp, to tell distcp to split large files into chunks, each containing a number of blocks specified by this new parameter, except the last chunk of a file may be smaller. CopyMapper will treat each chunk as a single file so the chunks can be copied in parallel; And the CopyCommitter concat the parts into one target file. With this switch, we will enable preserving block size, disable the randomization of entries in the sequence file, disable append feature. We could do further optimization as follow-ups. Any review is very welcome! Thanks a lot. In addition, thanks Wei-Chiu Chuang , Xiao Chen for assisting in an initial draft we did a while back, and the three of us will be contributers of this jira.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 12s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 12m 37s trunk passed
          +1 compile 0m 17s trunk passed
          +1 checkstyle 0m 15s trunk passed
          +1 mvnsite 0m 18s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 0m 25s trunk passed
          +1 javadoc 0m 13s trunk passed
          +1 mvninstall 0m 16s the patch passed
          +1 compile 0m 15s the patch passed
          +1 javac 0m 15s the patch passed
          -0 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 33 new + 230 unchanged - 11 fixed = 263 total (was 241)
          +1 mvnsite 0m 16s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 29s the patch passed
          -1 javadoc 0m 9s hadoop-tools_hadoop-distcp generated 3 new + 49 unchanged - 0 fixed = 52 total (was 49)
          -1 unit 10m 47s hadoop-distcp in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          28m 46s



          Reason Tests
          Failed junit tests hadoop.tools.mapred.TestCopyCommitter
            hadoop.tools.TestOptionsParser
            hadoop.tools.TestDistCpSystem



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12848419/HADOOP-11794.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux bc131b55cd5f 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 5d8b80e
          Default Java 1.8.0_111
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
          javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/artifact/patchprocess/diff-javadoc-javadoc-hadoop-tools_hadoop-distcp.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/artifact/patchprocess/patch-unit-hadoop-tools_hadoop-distcp.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/testReport/
          modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 12m 37s trunk passed +1 compile 0m 17s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 18s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 0m 25s trunk passed +1 javadoc 0m 13s trunk passed +1 mvninstall 0m 16s the patch passed +1 compile 0m 15s the patch passed +1 javac 0m 15s the patch passed -0 checkstyle 0m 13s hadoop-tools/hadoop-distcp: The patch generated 33 new + 230 unchanged - 11 fixed = 263 total (was 241) +1 mvnsite 0m 16s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 29s the patch passed -1 javadoc 0m 9s hadoop-tools_hadoop-distcp generated 3 new + 49 unchanged - 0 fixed = 52 total (was 49) -1 unit 10m 47s hadoop-distcp in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 28m 46s Reason Tests Failed junit tests hadoop.tools.mapred.TestCopyCommitter   hadoop.tools.TestOptionsParser   hadoop.tools.TestDistCpSystem Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12848419/HADOOP-11794.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux bc131b55cd5f 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 5d8b80e Default Java 1.8.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/artifact/patchprocess/diff-checkstyle-hadoop-tools_hadoop-distcp.txt javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/artifact/patchprocess/diff-javadoc-javadoc-hadoop-tools_hadoop-distcp.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/artifact/patchprocess/patch-unit-hadoop-tools_hadoop-distcp.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/testReport/ modules C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11473/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Patch rev 002 to fix test failures.

          Show
          yzhangal Yongjun Zhang added a comment - Patch rev 002 to fix test failures.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 15s Maven dependency ordering for branch
          +1 mvninstall 13m 6s trunk passed
          +1 compile 14m 24s trunk passed
          +1 checkstyle 1m 47s trunk passed
          +1 mvnsite 1m 36s trunk passed
          +1 mvneclipse 0m 41s trunk passed
          +1 findbugs 2m 42s trunk passed
          +1 javadoc 1m 17s trunk passed
          0 mvndep 0m 18s Maven dependency ordering for patch
          +1 mvninstall 1m 17s the patch passed
          +1 compile 11m 38s the patch passed
          +1 javac 11m 38s the patch passed
          -0 checkstyle 1m 51s root: The patch generated 26 new + 374 unchanged - 11 fixed = 400 total (was 385)
          +1 mvnsite 1m 38s the patch passed
          +1 mvneclipse 0m 49s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 7s the patch passed
          -1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 3 new + 49 unchanged - 0 fixed = 52 total (was 49)
          -1 unit 71m 4s hadoop-hdfs in the patch failed.
          +1 unit 11m 46s hadoop-distcp in the patch passed.
          +1 asflicense 0m 37s The patch does not generate ASF License warnings.
          166m 53s



          Reason Tests
          Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12848700/HADOOP-11794.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux f93789aa4d43 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ccf2d66
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/artifact/patchprocess/diff-checkstyle-root.txt
          javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/artifact/patchprocess/diff-javadoc-javadoc-hadoop-tools_hadoop-distcp.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 15s Maven dependency ordering for branch +1 mvninstall 13m 6s trunk passed +1 compile 14m 24s trunk passed +1 checkstyle 1m 47s trunk passed +1 mvnsite 1m 36s trunk passed +1 mvneclipse 0m 41s trunk passed +1 findbugs 2m 42s trunk passed +1 javadoc 1m 17s trunk passed 0 mvndep 0m 18s Maven dependency ordering for patch +1 mvninstall 1m 17s the patch passed +1 compile 11m 38s the patch passed +1 javac 11m 38s the patch passed -0 checkstyle 1m 51s root: The patch generated 26 new + 374 unchanged - 11 fixed = 400 total (was 385) +1 mvnsite 1m 38s the patch passed +1 mvneclipse 0m 49s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 7s the patch passed -1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 3 new + 49 unchanged - 0 fixed = 52 total (was 49) -1 unit 71m 4s hadoop-hdfs in the patch failed. +1 unit 11m 46s hadoop-distcp in the patch passed. +1 asflicense 0m 37s The patch does not generate ASF License warnings. 166m 53s Reason Tests Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12848700/HADOOP-11794.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f93789aa4d43 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ccf2d66 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/artifact/patchprocess/diff-checkstyle-root.txt javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/artifact/patchprocess/diff-javadoc-javadoc-hadoop-tools_hadoop-distcp.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11489/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          HI Mithun Radhakrishnan,

          Thanks much for the earlier discussion. Would you please help doing a review of the patch when convenient?

          Thanks and best regards.

          Show
          yzhangal Yongjun Zhang added a comment - HI Mithun Radhakrishnan , Thanks much for the earlier discussion. Would you please help doing a review of the patch when convenient? Thanks and best regards.
          Hide
          atm Aaron T. Myers added a comment -

          Latest patch looks pretty good to me. Just a few small comments from me:

          1. "randomdize" -> "randomize": // When splitLargeFile is enabled, we don't randomdize the copylist
          2. In two places you have basically "if (LOG.isDebugEnabled) { LOG.warn(...); }

            " You should do LOG.debug(...) in these places, and perhaps also make these debug messages a little more helpful instead of just "add1", which would require someone to read the source code to understand.

          3. I think this log message is a little misleading:
            +  CHUNK_SIZE("",
            +      new Option("chunksize", true, "Size of chunk in number of blocks when " +
            +          "splitting large files into chunks to copy in parallel")),
            

            Assuming I'm reading the code correctly, the way a file is determined to be "large" in this context is just if it has more blocks than the configured chunk size. This log message also seems to imply that there might be some other configuration option to enable/disable splitting large files at all. I think better text would be something like "If set to a positive value, files with more blocks than this value will be split at their block boundaries during transfer, and reassembled on the destination cluster. By default, files will be transmitted in their entirety without splitting."

          4. Rather than suppressing the checkstyle warnings, recommend implementing the builder pattern for the CopyListingFileStatus constructors. That should make things quite a bit clearer.
          5. There are a handful of lines that are changed that I think are just whitespace, but not a big deal.
          Show
          atm Aaron T. Myers added a comment - Latest patch looks pretty good to me. Just a few small comments from me: "randomdize" -> "randomize": // When splitLargeFile is enabled, we don't randomdize the copylist In two places you have basically "if (LOG.isDebugEnabled) { LOG.warn(...); } " You should do LOG.debug(...) in these places, and perhaps also make these debug messages a little more helpful instead of just "add1", which would require someone to read the source code to understand. I think this log message is a little misleading: + CHUNK_SIZE("", + new Option( "chunksize" , true , "Size of chunk in number of blocks when " + + "splitting large files into chunks to copy in parallel" )), Assuming I'm reading the code correctly, the way a file is determined to be "large" in this context is just if it has more blocks than the configured chunk size. This log message also seems to imply that there might be some other configuration option to enable/disable splitting large files at all. I think better text would be something like "If set to a positive value, files with more blocks than this value will be split at their block boundaries during transfer, and reassembled on the destination cluster. By default, files will be transmitted in their entirety without splitting." Rather than suppressing the checkstyle warnings, recommend implementing the builder pattern for the CopyListingFileStatus constructors. That should make things quite a bit clearer. There are a handful of lines that are changed that I think are just whitespace, but not a big deal.
          Hide
          yzhangal Yongjun Zhang added a comment -

          HI Aaron T. Myers,

          Thanks a lot for the review! All very good comments!

          I uploaded rev 003 to address all of them. In addition, I also added couple of new changes:
          1, make sure the target cluster is DistributedFileSystem, otherwise ignore the switch -chunksize with warning message
          2. added documentation

          Would you please help taking a look again?

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - HI Aaron T. Myers , Thanks a lot for the review! All very good comments! I uploaded rev 003 to address all of them. In addition, I also added couple of new changes: 1, make sure the target cluster is DistributedFileSystem, otherwise ignore the switch -chunksize with warning message 2. added documentation Would you please help taking a look again? Thanks.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 14s Maven dependency ordering for branch
          +1 mvninstall 12m 44s trunk passed
          +1 compile 13m 28s trunk passed
          +1 checkstyle 1m 39s trunk passed
          +1 mvnsite 1m 24s trunk passed
          +1 mvneclipse 0m 41s trunk passed
          +1 findbugs 2m 26s trunk passed
          +1 javadoc 1m 9s trunk passed
          0 mvndep 0m 16s Maven dependency ordering for patch
          +1 mvninstall 1m 9s the patch passed
          +1 compile 11m 9s the patch passed
          +1 javac 11m 9s the patch passed
          -0 checkstyle 1m 43s root: The patch generated 26 new + 374 unchanged - 11 fixed = 400 total (was 385)
          +1 mvnsite 1m 29s the patch passed
          +1 mvneclipse 0m 44s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 49s the patch passed
          -1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 3 new + 49 unchanged - 0 fixed = 52 total (was 49)
          -1 unit 70m 26s hadoop-hdfs in the patch failed.
          +1 unit 12m 1s hadoop-distcp in the patch passed.
          +1 asflicense 0m 37s The patch does not generate ASF License warnings.
          162m 37s



          Reason Tests
          Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850159/HADOOP-11794.003.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 33eac87c7a0f 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 87852b6
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/artifact/patchprocess/diff-checkstyle-root.txt
          javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/artifact/patchprocess/diff-javadoc-javadoc-hadoop-tools_hadoop-distcp.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 14s Maven dependency ordering for branch +1 mvninstall 12m 44s trunk passed +1 compile 13m 28s trunk passed +1 checkstyle 1m 39s trunk passed +1 mvnsite 1m 24s trunk passed +1 mvneclipse 0m 41s trunk passed +1 findbugs 2m 26s trunk passed +1 javadoc 1m 9s trunk passed 0 mvndep 0m 16s Maven dependency ordering for patch +1 mvninstall 1m 9s the patch passed +1 compile 11m 9s the patch passed +1 javac 11m 9s the patch passed -0 checkstyle 1m 43s root: The patch generated 26 new + 374 unchanged - 11 fixed = 400 total (was 385) +1 mvnsite 1m 29s the patch passed +1 mvneclipse 0m 44s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 49s the patch passed -1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 3 new + 49 unchanged - 0 fixed = 52 total (was 49) -1 unit 70m 26s hadoop-hdfs in the patch failed. +1 unit 12m 1s hadoop-distcp in the patch passed. +1 asflicense 0m 37s The patch does not generate ASF License warnings. 162m 37s Reason Tests Timed out junit tests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850159/HADOOP-11794.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 33eac87c7a0f 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 87852b6 Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/artifact/patchprocess/diff-checkstyle-root.txt javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/artifact/patchprocess/diff-javadoc-javadoc-hadoop-tools_hadoop-distcp.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11537/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          mithun Mithun Radhakrishnan added a comment -

          Wow, this is really good work. (I'm continually astonished at how much DistCp has been improved upon and added to.)
          Please forgive me, my DistCp-ese is a little rusty. I have a couple of minor questions:

          1. In DistCpUtils::toCopyListingFileStatus(), the javadoc says it "Converts a list of FileStatus to a list CopyListingFileStatus". The method does not take a List<FileStatus>. Shall we remove "list of"?
          2. Could we rephrase the doc to "Converts a `FileStatus` a list of `CopyListingFileStatus`. Returns either one CopyListingFileStatus per chunk of file-blocks (if file-size exceeds chunk-size), or one CopyListingFileStatus for the entire file (if file-size is too small to split)."?
          3. DistCpUtils::toCopyListingFileStatus() handles heterogeneous block-sizes via DFSClient.getBlockLocations(), but only if fileStatus.getLen() > fileStatus.getBlockSize()*chunkSize. Is it possible for an HDFS file with fileStatus.getBlockSize() == 256M to be composed entirely of tiny blocks (say 32MB)? Could we have a situation where a splittable file (with small blocks) ends up unsplit, because fileStatus.getBlockSize() >> effectiveBlockSize?
          4. I wonder if chunksize might be confused to be the "chunk-length in bytes" (like CopyListingFileStatus.chunkLength). I could be wrong, but would blocksPerChunk be less ambiguous? (Please ignore if this is too pervasive.)
          5. Nitpick: CopyListingFileStatus.toString() uses String concatenation inside a call to StringBuilder.apend(). (It was that way well before this patch. :/) Shall we replace this with a chain of .append() calls?
          6. In CopyCommitter::concatFileChunks(), could we please add additional logging for what files/chunks are being merged?

          Thanks so much for working on this, Yongjun Zhang. :]

          Show
          mithun Mithun Radhakrishnan added a comment - Wow, this is really good work. (I'm continually astonished at how much DistCp has been improved upon and added to.) Please forgive me, my DistCp-ese is a little rusty. I have a couple of minor questions: In DistCpUtils::toCopyListingFileStatus() , the javadoc says it "Converts a list of FileStatus to a list CopyListingFileStatus" . The method does not take a List<FileStatus> . Shall we remove "list of" ? Could we rephrase the doc to "Converts a `FileStatus` a list of `CopyListingFileStatus`. Returns either one CopyListingFileStatus per chunk of file-blocks (if file-size exceeds chunk-size), or one CopyListingFileStatus for the entire file (if file-size is too small to split)." ? DistCpUtils::toCopyListingFileStatus() handles heterogeneous block-sizes via DFSClient.getBlockLocations() , but only if fileStatus.getLen() > fileStatus.getBlockSize()*chunkSize . Is it possible for an HDFS file with fileStatus.getBlockSize() == 256M to be composed entirely of tiny blocks (say 32MB)? Could we have a situation where a splittable file (with small blocks) ends up unsplit, because fileStatus.getBlockSize() >> effectiveBlockSize ? I wonder if chunksize might be confused to be the "chunk-length in bytes" (like CopyListingFileStatus.chunkLength ). I could be wrong, but would blocksPerChunk be less ambiguous? (Please ignore if this is too pervasive.) Nitpick: CopyListingFileStatus.toString() uses String concatenation inside a call to StringBuilder.apend() . (It was that way well before this patch. :/) Shall we replace this with a chain of .append() calls? In CopyCommitter::concatFileChunks() , could we please add additional logging for what files/chunks are being merged? Thanks so much for working on this, Yongjun Zhang . :]
          Hide
          yzhangal Yongjun Zhang added a comment - - edited

          Hi Mithun Radhakrishnan,

          Thanks you so much for the review and all the good comments!

          I just uploaded rev 004 to address all of them.

          • To answer your question in 3, to avoid the extra RPC call to get all blocks of a file by ONLY calling the RPC when the file size is bigger than blockSize * blocksPerChunk, and then check if the number of blocks is bigger than blocksPerChunk. So it's possible that a file with many small blocks are not split. But I think that should be ok, because the patch here intend to deal with really large file, and variable size blocks are infrequent, this check maybe reasonably good. However, in the future, we could still improve it if necessary.
          • About 6. the logging is already done in the method mergeFileChunks, when debug logging is enabled.

          In addition, I also added one more condition to check if the source FS is DistributedFileSystem, otherwise, the file won't be splitted too.

          Wonder if you could take a look at the new patch.

          Thanks a lot.

          Show
          yzhangal Yongjun Zhang added a comment - - edited Hi Mithun Radhakrishnan , Thanks you so much for the review and all the good comments! I just uploaded rev 004 to address all of them. To answer your question in 3, to avoid the extra RPC call to get all blocks of a file by ONLY calling the RPC when the file size is bigger than blockSize * blocksPerChunk , and then check if the number of blocks is bigger than blocksPerChunk . So it's possible that a file with many small blocks are not split. But I think that should be ok, because the patch here intend to deal with really large file, and variable size blocks are infrequent, this check maybe reasonably good. However, in the future, we could still improve it if necessary. About 6. the logging is already done in the method mergeFileChunks , when debug logging is enabled. In addition, I also added one more condition to check if the source FS is DistributedFileSystem, otherwise, the file won't be splitted too. Wonder if you could take a look at the new patch. Thanks a lot.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 24s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 13s Maven dependency ordering for branch
          +1 mvninstall 13m 46s trunk passed
          +1 compile 14m 4s trunk passed
          +1 checkstyle 1m 48s trunk passed
          +1 mvnsite 1m 36s trunk passed
          +1 mvneclipse 0m 40s trunk passed
          +1 findbugs 2m 34s trunk passed
          +1 javadoc 1m 10s trunk passed
          0 mvndep 0m 17s Maven dependency ordering for patch
          +1 mvninstall 1m 26s the patch passed
          +1 compile 12m 31s the patch passed
          +1 javac 12m 31s the patch passed
          -0 checkstyle 1m 51s root: The patch generated 27 new + 374 unchanged - 11 fixed = 401 total (was 385)
          +1 mvnsite 1m 32s the patch passed
          +1 mvneclipse 0m 46s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 7s the patch passed
          -1 javadoc 0m 26s hadoop-distcp in the patch failed.
          -1 unit 73m 18s hadoop-hdfs in the patch failed.
          +1 unit 11m 57s hadoop-distcp in the patch passed.
          +1 asflicense 0m 38s The patch does not generate ASF License warnings.
          170m 13s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation
            hadoop.hdfs.TestMissingBlocksAlert
            hadoop.hdfs.server.datanode.checker.TestThrottledAsyncChecker



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850378/HADOOP-11794.004.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 270130d7026c 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / bec9b7a
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/artifact/patchprocess/diff-checkstyle-root.txt
          javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/artifact/patchprocess/patch-javadoc-hadoop-tools_hadoop-distcp.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 24s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 13s Maven dependency ordering for branch +1 mvninstall 13m 46s trunk passed +1 compile 14m 4s trunk passed +1 checkstyle 1m 48s trunk passed +1 mvnsite 1m 36s trunk passed +1 mvneclipse 0m 40s trunk passed +1 findbugs 2m 34s trunk passed +1 javadoc 1m 10s trunk passed 0 mvndep 0m 17s Maven dependency ordering for patch +1 mvninstall 1m 26s the patch passed +1 compile 12m 31s the patch passed +1 javac 12m 31s the patch passed -0 checkstyle 1m 51s root: The patch generated 27 new + 374 unchanged - 11 fixed = 401 total (was 385) +1 mvnsite 1m 32s the patch passed +1 mvneclipse 0m 46s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 7s the patch passed -1 javadoc 0m 26s hadoop-distcp in the patch failed. -1 unit 73m 18s hadoop-hdfs in the patch failed. +1 unit 11m 57s hadoop-distcp in the patch passed. +1 asflicense 0m 38s The patch does not generate ASF License warnings. 170m 13s Reason Tests Failed junit tests hadoop.hdfs.server.blockmanagement.TestRBWBlockInvalidation   hadoop.hdfs.TestMissingBlocksAlert   hadoop.hdfs.server.datanode.checker.TestThrottledAsyncChecker Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850378/HADOOP-11794.004.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 270130d7026c 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / bec9b7a Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/artifact/patchprocess/diff-checkstyle-root.txt javadoc https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/artifact/patchprocess/patch-javadoc-hadoop-tools_hadoop-distcp.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11552/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          rev 005 to address checkstyle and javadoc issues.

          Show
          yzhangal Yongjun Zhang added a comment - rev 005 to address checkstyle and javadoc issues.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 21s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 1m 56s Maven dependency ordering for branch
          +1 mvninstall 14m 53s trunk passed
          +1 compile 14m 57s trunk passed
          +1 checkstyle 1m 44s trunk passed
          +1 mvnsite 1m 31s trunk passed
          +1 mvneclipse 0m 48s trunk passed
          +1 findbugs 2m 43s trunk passed
          +1 javadoc 1m 14s trunk passed
          0 mvndep 0m 17s Maven dependency ordering for patch
          +1 mvninstall 1m 19s the patch passed
          +1 compile 15m 39s the patch passed
          +1 javac 15m 39s the patch passed
          -0 checkstyle 2m 9s root: The patch generated 1 new + 376 unchanged - 11 fixed = 377 total (was 387)
          +1 mvnsite 2m 0s the patch passed
          +1 mvneclipse 0m 51s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 54s the patch passed
          +1 javadoc 0m 52s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 23s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          -1 unit 76m 6s hadoop-hdfs in the patch failed.
          +1 unit 12m 18s hadoop-distcp in the patch passed.
          +1 asflicense 0m 50s The patch does not generate ASF License warnings.
          181m 37s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestDecommissioningStatus
            hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850464/HADOOP-11794.005.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 69a1c5342d85 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / b6f290d
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 1m 56s Maven dependency ordering for branch +1 mvninstall 14m 53s trunk passed +1 compile 14m 57s trunk passed +1 checkstyle 1m 44s trunk passed +1 mvnsite 1m 31s trunk passed +1 mvneclipse 0m 48s trunk passed +1 findbugs 2m 43s trunk passed +1 javadoc 1m 14s trunk passed 0 mvndep 0m 17s Maven dependency ordering for patch +1 mvninstall 1m 19s the patch passed +1 compile 15m 39s the patch passed +1 javac 15m 39s the patch passed -0 checkstyle 2m 9s root: The patch generated 1 new + 376 unchanged - 11 fixed = 377 total (was 387) +1 mvnsite 2m 0s the patch passed +1 mvneclipse 0m 51s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 54s the patch passed +1 javadoc 0m 52s hadoop-hdfs in the patch passed. +1 javadoc 0m 23s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) -1 unit 76m 6s hadoop-hdfs in the patch failed. +1 unit 12m 18s hadoop-distcp in the patch passed. +1 asflicense 0m 50s The patch does not generate ASF License warnings. 181m 37s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestDecommissioningStatus   hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850464/HADOOP-11794.005.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 69a1c5342d85 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b6f290d Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11554/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 13s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 14s Maven dependency ordering for branch
          +1 mvninstall 12m 59s trunk passed
          +1 compile 17m 28s trunk passed
          +1 checkstyle 1m 42s trunk passed
          +1 mvnsite 1m 22s trunk passed
          +1 mvneclipse 0m 44s trunk passed
          +1 findbugs 2m 30s trunk passed
          +1 javadoc 1m 9s trunk passed
          0 mvndep 0m 17s Maven dependency ordering for patch
          +1 mvninstall 1m 21s the patch passed
          +1 compile 13m 17s the patch passed
          +1 javac 13m 17s the patch passed
          -0 checkstyle 1m 42s root: The patch generated 1 new + 375 unchanged - 11 fixed = 376 total (was 386)
          +1 mvnsite 1m 37s the patch passed
          +1 mvneclipse 0m 45s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 48s the patch passed
          +1 javadoc 0m 51s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          -1 unit 71m 5s hadoop-hdfs in the patch failed.
          +1 unit 11m 47s hadoop-distcp in the patch passed.
          +1 asflicense 0m 42s The patch does not generate ASF License warnings.
          169m 47s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.namenode.TestStartup
            hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850464/HADOOP-11794.005.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 19fc42a38ba4 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / b6f290d
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 14s Maven dependency ordering for branch +1 mvninstall 12m 59s trunk passed +1 compile 17m 28s trunk passed +1 checkstyle 1m 42s trunk passed +1 mvnsite 1m 22s trunk passed +1 mvneclipse 0m 44s trunk passed +1 findbugs 2m 30s trunk passed +1 javadoc 1m 9s trunk passed 0 mvndep 0m 17s Maven dependency ordering for patch +1 mvninstall 1m 21s the patch passed +1 compile 13m 17s the patch passed +1 javac 13m 17s the patch passed -0 checkstyle 1m 42s root: The patch generated 1 new + 375 unchanged - 11 fixed = 376 total (was 386) +1 mvnsite 1m 37s the patch passed +1 mvneclipse 0m 45s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 48s the patch passed +1 javadoc 0m 51s hadoop-hdfs in the patch passed. +1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) -1 unit 71m 5s hadoop-hdfs in the patch failed. +1 unit 11m 47s hadoop-distcp in the patch passed. +1 asflicense 0m 42s The patch does not generate ASF License warnings. 169m 47s Reason Tests Failed junit tests hadoop.hdfs.server.namenode.TestStartup   hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaRecovery Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850464/HADOOP-11794.005.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 19fc42a38ba4 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b6f290d Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11555/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          rev 6 to address checkstyle issue, some misc changes

          • Improved exception message
          • reorder condition checking when deciding whether to split file for better performance
          • add two more "__" to the chunk file names
          Show
          yzhangal Yongjun Zhang added a comment - rev 6 to address checkstyle issue, some misc changes Improved exception message reorder condition checking when deciding whether to split file for better performance add two more "__" to the chunk file names
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 21s Maven dependency ordering for branch
          +1 mvninstall 13m 12s trunk passed
          +1 compile 14m 25s trunk passed
          +1 checkstyle 1m 44s trunk passed
          +1 mvnsite 1m 34s trunk passed
          +1 mvneclipse 0m 44s trunk passed
          +1 findbugs 2m 41s trunk passed
          +1 javadoc 1m 10s trunk passed
          0 mvndep 0m 17s Maven dependency ordering for patch
          +1 mvninstall 1m 16s the patch passed
          +1 compile 12m 42s the patch passed
          +1 javac 12m 42s the patch passed
          +1 checkstyle 1m 54s root: The patch generated 0 new + 375 unchanged - 11 fixed = 375 total (was 386)
          +1 mvnsite 1m 42s the patch passed
          +1 mvneclipse 0m 45s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 15s the patch passed
          +1 javadoc 0m 55s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          -1 unit 71m 44s hadoop-hdfs in the patch failed.
          +1 unit 12m 20s hadoop-distcp in the patch passed.
          +1 asflicense 0m 44s The patch does not generate ASF License warnings.
          169m 12s



          Reason Tests
          Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850883/HADOOP-11794.006.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b24de0651763 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / e023584
          Default Java 1.8.0_121
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11577/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11577/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11577/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 21s Maven dependency ordering for branch +1 mvninstall 13m 12s trunk passed +1 compile 14m 25s trunk passed +1 checkstyle 1m 44s trunk passed +1 mvnsite 1m 34s trunk passed +1 mvneclipse 0m 44s trunk passed +1 findbugs 2m 41s trunk passed +1 javadoc 1m 10s trunk passed 0 mvndep 0m 17s Maven dependency ordering for patch +1 mvninstall 1m 16s the patch passed +1 compile 12m 42s the patch passed +1 javac 12m 42s the patch passed +1 checkstyle 1m 54s root: The patch generated 0 new + 375 unchanged - 11 fixed = 375 total (was 386) +1 mvnsite 1m 42s the patch passed +1 mvneclipse 0m 45s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 15s the patch passed +1 javadoc 0m 55s hadoop-hdfs in the patch passed. +1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) -1 unit 71m 44s hadoop-hdfs in the patch failed. +1 unit 12m 20s hadoop-distcp in the patch passed. +1 asflicense 0m 44s The patch does not generate ASF License warnings. 169m 12s Reason Tests Failed junit tests hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12850883/HADOOP-11794.006.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b24de0651763 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / e023584 Default Java 1.8.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11577/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11577/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11577/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Rev 007:

          1. Better -update handling, only copy files that need to be updated
          2. Better error handling: if user enable ignore error, then continue after error
          3. Do sanity checking in CopyCommitter, to ensure the chunk files are continuous for a given file
          4. Corrected implementation of file comparison in unit test, to handle variable size blocks in a file.
          5. Added some additional unit tests.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Rev 007: 1. Better -update handling, only copy files that need to be updated 2. Better error handling: if user enable ignore error, then continue after error 3. Do sanity checking in CopyCommitter, to ensure the chunk files are continuous for a given file 4. Corrected implementation of file comparison in unit test, to handle variable size blocks in a file. 5. Added some additional unit tests. Thanks.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 17s Maven dependency ordering for branch
          +1 mvninstall 14m 6s trunk passed
          +1 compile 13m 8s trunk passed
          +1 checkstyle 1m 41s trunk passed
          +1 mvnsite 1m 26s trunk passed
          +1 mvneclipse 0m 39s trunk passed
          +1 findbugs 2m 33s trunk passed
          +1 javadoc 1m 7s trunk passed
          0 mvndep 0m 18s Maven dependency ordering for patch
          +1 mvninstall 1m 21s the patch passed
          +1 compile 12m 1s the patch passed
          +1 javac 12m 1s the patch passed
          -0 checkstyle 1m 42s root: The patch generated 5 new + 374 unchanged - 12 fixed = 379 total (was 386)
          +1 mvnsite 1m 34s the patch passed
          +1 mvneclipse 0m 43s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 47s the patch passed
          +1 javadoc 0m 51s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 21s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          -1 unit 77m 52s hadoop-hdfs in the patch failed.
          +1 unit 11m 59s hadoop-distcp in the patch passed.
          +1 asflicense 0m 45s The patch does not generate ASF License warnings.
          172m 27s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestDFSRSDefault10x4StripedOutputStreamWithFailure
            hadoop.hdfs.TestDistributedFileSystem
            hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851564/HADOOP-11794.007.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux f551dea4e114 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 2007e0c
          Default Java 1.8.0_121
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/artifact/patchprocess/diff-checkstyle-root.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 17s Maven dependency ordering for branch +1 mvninstall 14m 6s trunk passed +1 compile 13m 8s trunk passed +1 checkstyle 1m 41s trunk passed +1 mvnsite 1m 26s trunk passed +1 mvneclipse 0m 39s trunk passed +1 findbugs 2m 33s trunk passed +1 javadoc 1m 7s trunk passed 0 mvndep 0m 18s Maven dependency ordering for patch +1 mvninstall 1m 21s the patch passed +1 compile 12m 1s the patch passed +1 javac 12m 1s the patch passed -0 checkstyle 1m 42s root: The patch generated 5 new + 374 unchanged - 12 fixed = 379 total (was 386) +1 mvnsite 1m 34s the patch passed +1 mvneclipse 0m 43s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 47s the patch passed +1 javadoc 0m 51s hadoop-hdfs in the patch passed. +1 javadoc 0m 21s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) -1 unit 77m 52s hadoop-hdfs in the patch failed. +1 unit 11m 59s hadoop-distcp in the patch passed. +1 asflicense 0m 45s The patch does not generate ASF License warnings. 172m 27s Reason Tests Failed junit tests hadoop.hdfs.TestDFSRSDefault10x4StripedOutputStreamWithFailure   hadoop.hdfs.TestDistributedFileSystem   hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851564/HADOOP-11794.007.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f551dea4e114 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 2007e0c Default Java 1.8.0_121 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/artifact/patchprocess/diff-checkstyle-root.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11596/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Lu Tao Lu Tao added a comment -

          very useful improvement!!

          Show
          Lu Tao Lu Tao added a comment - very useful improvement!!
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks for the positive feedback Lu Tao!

          Hi Mithun Radhakrishnan and Aaron T. Myers, would you please help taking a look at the latest patch to see if all your comments are addressed?

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks for the positive feedback Lu Tao ! Hi Mithun Radhakrishnan and Aaron T. Myers , would you please help taking a look at the latest patch to see if all your comments are addressed? Thanks.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 1m 49s Maven dependency ordering for branch
          +1 mvninstall 11m 57s trunk passed
          +1 compile 12m 29s trunk passed
          +1 checkstyle 1m 40s trunk passed
          +1 mvnsite 1m 11s trunk passed
          +1 mvneclipse 1m 23s trunk passed
          +1 findbugs 2m 16s trunk passed
          +1 javadoc 0m 57s trunk passed
          0 mvndep 1m 8s Maven dependency ordering for patch
          -1 mvninstall 0m 45s hadoop-hdfs in the patch failed.
          -1 mvninstall 0m 16s hadoop-distcp in the patch failed.
          +1 compile 10m 9s the patch passed
          +1 javac 10m 9s the patch passed
          +1 checkstyle 1m 49s the patch passed
          -1 mvnsite 0m 22s hadoop-distcp in the patch failed.
          +1 mvneclipse 0m 36s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          -1 findbugs 0m 18s hadoop-distcp in the patch failed.
          +1 javadoc 0m 46s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 17s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          -1 unit 64m 6s hadoop-hdfs in the patch failed.
          -1 unit 0m 25s hadoop-distcp in the patch failed.
          +1 asflicense 0m 30s The patch does not generate ASF License warnings.
          143m 6s



          Reason Tests
          Failed junit tests hadoop.hdfs.web.TestWebHdfsTimeouts



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851564/HADOOP-11794.007.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 92355ce8312a 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 694e680
          Default Java 1.8.0_121
          findbugs v3.0.0
          mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt
          mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-mvninstall-hadoop-tools_hadoop-distcp.txt
          mvnsite https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-mvnsite-hadoop-tools_hadoop-distcp.txt
          findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-findbugs-hadoop-tools_hadoop-distcp.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-unit-hadoop-tools_hadoop-distcp.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 1m 49s Maven dependency ordering for branch +1 mvninstall 11m 57s trunk passed +1 compile 12m 29s trunk passed +1 checkstyle 1m 40s trunk passed +1 mvnsite 1m 11s trunk passed +1 mvneclipse 1m 23s trunk passed +1 findbugs 2m 16s trunk passed +1 javadoc 0m 57s trunk passed 0 mvndep 1m 8s Maven dependency ordering for patch -1 mvninstall 0m 45s hadoop-hdfs in the patch failed. -1 mvninstall 0m 16s hadoop-distcp in the patch failed. +1 compile 10m 9s the patch passed +1 javac 10m 9s the patch passed +1 checkstyle 1m 49s the patch passed -1 mvnsite 0m 22s hadoop-distcp in the patch failed. +1 mvneclipse 0m 36s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. -1 findbugs 0m 18s hadoop-distcp in the patch failed. +1 javadoc 0m 46s hadoop-hdfs in the patch passed. +1 javadoc 0m 17s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) -1 unit 64m 6s hadoop-hdfs in the patch failed. -1 unit 0m 25s hadoop-distcp in the patch failed. +1 asflicense 0m 30s The patch does not generate ASF License warnings. 143m 6s Reason Tests Failed junit tests hadoop.hdfs.web.TestWebHdfsTimeouts Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851564/HADOOP-11794.007.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 92355ce8312a 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 694e680 Default Java 1.8.0_121 findbugs v3.0.0 mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt mvninstall https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-mvninstall-hadoop-tools_hadoop-distcp.txt mvnsite https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-mvnsite-hadoop-tools_hadoop-distcp.txt findbugs https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-findbugs-hadoop-tools_hadoop-distcp.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/artifact/patchprocess/patch-unit-hadoop-tools_hadoop-distcp.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11702/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Looks like something changed in the build side, same patch rev7 was almost clean in last run (except for checkstyle, and a few unrelated test failure), but it different errors above.

          Uploaded new rev8 which has a fix for totalBytesToCopy.

          Show
          yzhangal Yongjun Zhang added a comment - Looks like something changed in the build side, same patch rev7 was almost clean in last run (except for checkstyle, and a few unrelated test failure), but it different errors above. Uploaded new rev8 which has a fix for totalBytesToCopy.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 14s Maven dependency ordering for branch
          +1 mvninstall 12m 26s trunk passed
          +1 compile 13m 35s trunk passed
          +1 checkstyle 1m 52s trunk passed
          +1 mvnsite 1m 20s trunk passed
          +1 mvneclipse 0m 39s trunk passed
          +1 findbugs 2m 23s trunk passed
          +1 javadoc 1m 7s trunk passed
          0 mvndep 0m 16s Maven dependency ordering for patch
          +1 mvninstall 1m 7s the patch passed
          +1 compile 11m 19s the patch passed
          +1 javac 11m 19s the patch passed
          +1 checkstyle 1m 57s the patch passed
          +1 mvnsite 1m 27s the patch passed
          +1 mvneclipse 0m 44s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 42s the patch passed
          +1 javadoc 0m 50s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          +1 unit 64m 16s hadoop-hdfs in the patch passed.
          +1 unit 11m 21s hadoop-distcp in the patch passed.
          +1 asflicense 0m 39s The patch does not generate ASF License warnings.
          156m 6s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12854375/HADOOP-11794.008.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 8341f7763e65 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 132f758
          Default Java 1.8.0_121
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11706/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11706/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 14s Maven dependency ordering for branch +1 mvninstall 12m 26s trunk passed +1 compile 13m 35s trunk passed +1 checkstyle 1m 52s trunk passed +1 mvnsite 1m 20s trunk passed +1 mvneclipse 0m 39s trunk passed +1 findbugs 2m 23s trunk passed +1 javadoc 1m 7s trunk passed 0 mvndep 0m 16s Maven dependency ordering for patch +1 mvninstall 1m 7s the patch passed +1 compile 11m 19s the patch passed +1 javac 11m 19s the patch passed +1 checkstyle 1m 57s the patch passed +1 mvnsite 1m 27s the patch passed +1 mvneclipse 0m 44s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 42s the patch passed +1 javadoc 0m 50s hadoop-hdfs in the patch passed. +1 javadoc 0m 24s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) +1 unit 64m 16s hadoop-hdfs in the patch passed. +1 unit 11m 21s hadoop-distcp in the patch passed. +1 asflicense 0m 39s The patch does not generate ASF License warnings. 156m 6s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12854375/HADOOP-11794.008.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8341f7763e65 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 132f758 Default Java 1.8.0_121 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11706/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11706/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          zshao Zheng Shao added a comment -

          Yongjun Zhang Can you provide a github link to make review easier? I am very interested.

          Show
          zshao Zheng Shao added a comment - Yongjun Zhang Can you provide a github link to make review easier? I am very interested.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks for being interested Zheng Shao,

          Wonder if you would consider trying patch viewer https://chrome.google.com/webstore/detail/git-patch-viewer/hkoggakcdopbgnaeeidcmopfekipkleg?hl=en-US ? It works nicely in browser, you click on the patch and it show the changes made by the patch side by side.

          When patch viewer is not sufficient, I usually review patch by applying the patch and doing "git difftool" with a graphical viewer. I saw most folks upload patches to jira and use this way to view.

          Hope that helps,

          Show
          yzhangal Yongjun Zhang added a comment - Thanks for being interested Zheng Shao , Wonder if you would consider trying patch viewer https://chrome.google.com/webstore/detail/git-patch-viewer/hkoggakcdopbgnaeeidcmopfekipkleg?hl=en-US ? It works nicely in browser, you click on the patch and it show the changes made by the patch side by side. When patch viewer is not sufficient, I usually review patch by applying the patch and doing "git difftool" with a graphical viewer. I saw most folks upload patches to jira and use this way to view. Hope that helps,
          Hide
          stevel@apache.org Steve Loughran added a comment -
          1. this is an opportunity to switch distcp over to using the slf4j logger class; existing logging can be left alone, but all new logs can switch to the inline logging
          2. What does "YJD ls before distcp" in tests mean?
          3. TestDistCpSystem does a cleanup in testDistcpLargeFile as the last operation in a successful test run. Does it still cleanup on a failure? If not, what is the final state of the call & does it matter
          4. in the s3a tests we now have a -Pscale profile for scalable tests, and can set file sizes. It might be nice to have here, but it's a complex piece of work: not really justifiable except as a bigger set of scale tests
          Show
          stevel@apache.org Steve Loughran added a comment - this is an opportunity to switch distcp over to using the slf4j logger class; existing logging can be left alone, but all new logs can switch to the inline logging What does "YJD ls before distcp" in tests mean? TestDistCpSystem does a cleanup in testDistcpLargeFile as the last operation in a successful test run. Does it still cleanup on a failure? If not, what is the final state of the call & does it matter in the s3a tests we now have a -Pscale profile for scalable tests, and can set file sizes. It might be nice to have here, but it's a complex piece of work: not really justifiable except as a bigger set of scale tests
          Hide
          omkarksa Omkar Aradhya K S added a comment - - edited

          I was trying to evaluate your patch with ADLS:
          Tried the bits on a HDInsight 3.5 cluster (this comes with hadoop 2.7)
          Observed following compatibility issues:
          a. You are checking for instance of DistributedFileSystem in many places and all other FileSystem implementations don’t implement DistributedFileSystem
          i. Could this be changed to something more compatible with other FileSystem implementations?
          b. You are using the new DFSUtilClient, which makes DistCp incompatible with older versions of Hadoop
          i. Can this be changed to be backward compatible?
          If the compatibility issues are addressed, the DistCp with your feature would be available for other FileSystem implementations and also would be backward compatible.

          Show
          omkarksa Omkar Aradhya K S added a comment - - edited I was trying to evaluate your patch with ADLS: Tried the bits on a HDInsight 3.5 cluster (this comes with hadoop 2.7) Observed following compatibility issues: a. You are checking for instance of DistributedFileSystem in many places and all other FileSystem implementations don’t implement DistributedFileSystem i. Could this be changed to something more compatible with other FileSystem implementations? b. You are using the new DFSUtilClient , which makes DistCp incompatible with older versions of Hadoop i. Can this be changed to be backward compatible? If the compatibility issues are addressed, the DistCp with your feature would be available for other FileSystem implementations and also would be backward compatible.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks much for reviewing and trying Steve Loughran and Omkar Aradhya K S!

          this is an opportunity to switch distcp over to using the slf4j logger class; existing logging can be left alone, but all new logs can switch to the inline logging

          Since this jira has been going on for long, I hope we can address logger issue as a separate follow-up jira.

          What does "YJD ls before distcp" in tests mean?

          Good catch, I forgot to drop some debugging stuff in test code. will in next rev.

          Does it still cleanup on a failure? If not, what is the final state of the call & does it matter

          It does not really matter since the test failed, but cleaning it up would be ok too.

          in the s3a tests we now have a -Pscale profile for scalable tests, and can set file sizes. It might be nice to have here, but it's a complex piece of work: not really justifiable except as a bigger set of scale tests

          Scale test is a good thing to do, the unit of the patch mostly focus on functionality.

          5. Observed following compatibility issues:
          a. You are checking for instance of DistributedFileSystem in many places and all other FileSystem implementations don’t implement DistributedFileSystem
          i. Could this be changed to something more compatible with other implementations of FileSystem?

          The main reason of checking DistributedFileSystem is the support of getBlockLocations, and concat feature. I'm not sure whether we can assume other File System support that.

          b. You are using the new DFSUtilClient, which makes DistCp incompatible with older versions of Hadoop
          i. Can this be changed to be backward compatible

          The current patch is for trunk where client and server code are separated. When we backport this change to other version of hadoop, we can make the change accordingly, for example, to use DFSUtil.

          6. If the compatibility issues are addressed, the new DistCp with your feature would be available for other FileSystem implementations as well as backward compatible.
          a. I was able to make little modifications to your patch and got it working with ADLS.

          Good work there! Glad to hear that it works for you with little modifications. I think we can probably commit this patch first, and then do other work as improvement jiras.

          Thanks again!

          Show
          yzhangal Yongjun Zhang added a comment - Thanks much for reviewing and trying Steve Loughran and Omkar Aradhya K S ! this is an opportunity to switch distcp over to using the slf4j logger class; existing logging can be left alone, but all new logs can switch to the inline logging Since this jira has been going on for long, I hope we can address logger issue as a separate follow-up jira. What does "YJD ls before distcp" in tests mean? Good catch, I forgot to drop some debugging stuff in test code. will in next rev. Does it still cleanup on a failure? If not, what is the final state of the call & does it matter It does not really matter since the test failed, but cleaning it up would be ok too. in the s3a tests we now have a -Pscale profile for scalable tests, and can set file sizes. It might be nice to have here, but it's a complex piece of work: not really justifiable except as a bigger set of scale tests Scale test is a good thing to do, the unit of the patch mostly focus on functionality. 5. Observed following compatibility issues: a. You are checking for instance of DistributedFileSystem in many places and all other FileSystem implementations don’t implement DistributedFileSystem i. Could this be changed to something more compatible with other implementations of FileSystem? The main reason of checking DistributedFileSystem is the support of getBlockLocations, and concat feature. I'm not sure whether we can assume other File System support that. b. You are using the new DFSUtilClient, which makes DistCp incompatible with older versions of Hadoop i. Can this be changed to be backward compatible The current patch is for trunk where client and server code are separated. When we backport this change to other version of hadoop, we can make the change accordingly, for example, to use DFSUtil. 6. If the compatibility issues are addressed, the new DistCp with your feature would be available for other FileSystem implementations as well as backward compatible. a. I was able to make little modifications to your patch and got it working with ADLS. Good work there! Glad to hear that it works for you with little modifications. I think we can probably commit this patch first, and then do other work as improvement jiras. Thanks again!
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          The main reason of checking DistributedFileSystem is the support of getBlockLocations, and concat feature. I'm not sure whether we can assume other File System support that.

          The getFileBlockLocations and concat are APIs that are part of FileSystem.java from hadoop v1.2.1

          The current patch is for trunk where client and server code are separated. When we backport this change to other version of hadoop, we can make the change accordingly, for example, to use DFSUtil.

          You could just use the default constructor that would internally get the NNAddress:

          final DFSClient dfs = new DFSClient(conf);
          
          Show
          omkarksa Omkar Aradhya K S added a comment - The main reason of checking DistributedFileSystem is the support of getBlockLocations, and concat feature. I'm not sure whether we can assume other File System support that. The getFileBlockLocations and concat are APIs that are part of FileSystem.java from hadoop v1.2.1 The current patch is for trunk where client and server code are separated. When we backport this change to other version of hadoop, we can make the change accordingly, for example, to use DFSUtil. You could just use the default constructor that would internally get the NNAddress: final DFSClient dfs = new DFSClient(conf);
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Omkar Aradhya K S.

          I chose the other method because the following one is marked as Deprecated. Not sure why it was, it seems a convenient API to exist.

            /**
             * Same as this(NameNode.getNNAddress(conf), conf);
             * @see #DFSClient(InetSocketAddress, Configuration)
             * @deprecated Deprecated at 0.21
             */
            @Deprecated
            public DFSClient(Configuration conf) throws IOException {
              this(DFSUtilClient.getNNAddress(conf), conf);
            }
          

          About other file systems, I did not get to test various file systems with this patch (except for DistributedFileSystem), we could follow-up with new jira to relax the file system requirement, and add corresponding tests for the corresponding file system?

          Hi Aaron T. Myers, I will address the most recent comments since you reviewed last time. Would you please take a look at rev8 to see if you have additional comments?

          Thanks a lot.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Omkar Aradhya K S . I chose the other method because the following one is marked as Deprecated. Not sure why it was, it seems a convenient API to exist. /** * Same as this (NameNode.getNNAddress(conf), conf); * @see #DFSClient(InetSocketAddress, Configuration) * @deprecated Deprecated at 0.21 */ @Deprecated public DFSClient(Configuration conf) throws IOException { this (DFSUtilClient.getNNAddress(conf), conf); } About other file systems, I did not get to test various file systems with this patch (except for DistributedFileSystem), we could follow-up with new jira to relax the file system requirement, and add corresponding tests for the corresponding file system? Hi Aaron T. Myers , I will address the most recent comments since you reviewed last time. Would you please take a look at rev8 to see if you have additional comments? Thanks a lot.
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Thanks for the clarification Yongjun Zhang.
          About this ...

          About other file systems, I did not get to test various file systems with this patch (except for DistributedFileSystem), we could follow-up with new jira to relax the file system requirement, and add corresponding tests for the corresponding file system?

          Is there any reason not to use FileSystem.concat & FileSystem.getFileBlockLocations ?

          Show
          omkarksa Omkar Aradhya K S added a comment - Thanks for the clarification Yongjun Zhang . About this ... About other file systems, I did not get to test various file systems with this patch (except for DistributedFileSystem), we could follow-up with new jira to relax the file system requirement, and add corresponding tests for the corresponding file system? Is there any reason not to use FileSystem.concat & FileSystem.getFileBlockLocations ?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Is there any reason not to use FileSystem.concat & FileSystem.getFileBlockLocations ?

          {{FileSystem.getFileBlockLocations }} is something filesystems have to implement this otherwise basic client code fails; if they don't have locality they tend just to say "localhost" and 1 block.

          Concat though, barely implemented. As the FS spec says, "This is a little-used operation currently implemented only by HDFS"

          Looking for subclasses of FileSyste.concat() It looks like only: hdfs, webhdfs, httpfs, filterfilesystem do lt.

          supporting webhdfs would be really good, as its the one recommended for cross-hadoop version distcp, and long-haul.

          For now, how about have it it check for HDFS and webhdfs and rejects if anything else.

          Show
          stevel@apache.org Steve Loughran added a comment - Is there any reason not to use FileSystem.concat & FileSystem.getFileBlockLocations ? {{FileSystem.getFileBlockLocations }} is something filesystems have to implement this otherwise basic client code fails; if they don't have locality they tend just to say "localhost" and 1 block. Concat though, barely implemented. As the FS spec says , "This is a little-used operation currently implemented only by HDFS" Looking for subclasses of FileSyste.concat() It looks like only: hdfs, webhdfs, httpfs, filterfilesystem do lt. supporting webhdfs would be really good, as its the one recommended for cross-hadoop version distcp, and long-haul. For now, how about have it it check for HDFS and webhdfs and rejects if anything else.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          One of the problems we have here is there's no API for impls to declare what they do; HADOOP-9565 has discussed this, but it's stalled. As it is, no way to determine if an FS implements a feature unless probed.

          There is always the option of doing that: sending in an invalid concat() request and differentiating between: UnsupportedException and any other response, then assuming that the "any other response" exception means that it is implemented, but that the arguments were invalid. concat("/", new Path[0]) should be enough.

          Omkar, are you planning to do a new concat? Because it might be that for different filesystems, there are better things to do.
          For S3, we could attempt to do multipart PUT operations in parallel, though that would be somewhat complicated by the fact you need to know the request ID before any part of the operation begins. If you were doing the upload from a single machine, the block output stream writes data in blocks now anyway.

          I don't know about other object stores, but we can and should think about how best to support them, even assuming their parallel upload mechanisms are similarly unique. It may be that the code needs to be worked to support different partitioning & scheduling for different endpoints. FWIW, I've been contemplating what it would take to do one in Spark, because that might let me get away with starting the upload before even the listing has finished, and reschedule work where there is capacity, rather than decide up front how to break things up. Supporting parallelised chunk upload wasn't something I'd considered though. An extra complication.

          Show
          stevel@apache.org Steve Loughran added a comment - One of the problems we have here is there's no API for impls to declare what they do; HADOOP-9565 has discussed this, but it's stalled. As it is, no way to determine if an FS implements a feature unless probed. There is always the option of doing that: sending in an invalid concat() request and differentiating between: UnsupportedException and any other response, then assuming that the "any other response" exception means that it is implemented, but that the arguments were invalid. concat("/", new Path [0] ) should be enough. Omkar, are you planning to do a new concat? Because it might be that for different filesystems, there are better things to do. For S3, we could attempt to do multipart PUT operations in parallel, though that would be somewhat complicated by the fact you need to know the request ID before any part of the operation begins. If you were doing the upload from a single machine, the block output stream writes data in blocks now anyway. I don't know about other object stores, but we can and should think about how best to support them, even assuming their parallel upload mechanisms are similarly unique. It may be that the code needs to be worked to support different partitioning & scheduling for different endpoints. FWIW, I've been contemplating what it would take to do one in Spark, because that might let me get away with starting the upload before even the listing has finished, and reschedule work where there is capacity, rather than decide up front how to break things up. Supporting parallelised chunk upload wasn't something I'd considered though. An extra complication.
          Hide
          jzhuge John Zhuge added a comment -

          Steve Loughran concat is implemented by ADLS backend as a constant operation.

          Show
          jzhuge John Zhuge added a comment - Steve Loughran concat is implemented by ADLS backend as a constant operation.
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Thanks Steve Loughran, John Zhuge,

          Yes, this could be one way to do it. Let me see if there is any other way.

          There is always the option of doing that: sending in an invalid concat() request and differentiating between: UnsupportedException and any other response, then assuming that the "any other response" exception means that it is implemented, but that the arguments were invalid. concat("/", new Path[0]) should be enough.

          That's right ADLS supports both concat and getFileBlockLocations and as commented before in this JIRA, I was able to get this patch working with ADLS with these changes and one more change.

          Show
          omkarksa Omkar Aradhya K S added a comment - Thanks Steve Loughran , John Zhuge , Yes, this could be one way to do it. Let me see if there is any other way. There is always the option of doing that: sending in an invalid concat() request and differentiating between: UnsupportedException and any other response, then assuming that the "any other response" exception means that it is implemented, but that the arguments were invalid. concat("/", new Path [0] ) should be enough. That's right ADLS supports both concat and getFileBlockLocations and as commented before in this JIRA, I was able to get this patch working with ADLS with these changes and one more change.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I'm less worried about GFBL: everything at least makes something up there. If concat works in ADLS then it should be supported.

          Given this feature only turns on if blocksperchunk is set, then maybe we should just say "if you enable that, your destination had better support concat". That could be added in the --help usage. the concat call can be added by a catch UnsupportedException handler, which tells the caller to stop it.

          The test for this should go alongside/inside AbstractContractDistCpTest then, so that it can be tested in all filesystems which support concat

          Show
          stevel@apache.org Steve Loughran added a comment - I'm less worried about GFBL: everything at least makes something up there. If concat works in ADLS then it should be supported. Given this feature only turns on if blocksperchunk is set, then maybe we should just say "if you enable that, your destination had better support concat". That could be added in the --help usage. the concat call can be added by a catch UnsupportedException handler, which tells the caller to stop it. The test for this should go alongside/inside AbstractContractDistCpTest then, so that it can be tested in all filesystems which support concat
          Hide
          yzhangal Yongjun Zhang added a comment -

          Hi Omkar Aradhya K S, Steve Loughran John Zhuge,

          Thanks for the good discussion here. All are very good comments!

          I personally prefer a step by step approach, this jira can be a foundation to add new additional support and improvement, such as webhdfs, ADLs. It's easier to manage that way. The new jira will not only include the code change (even if it's minimum), but also some unit tests.

          Omkar, what about we create a follow-up jira for ADLs, and we work together on the jira after this jira is in?

          Steve, I will also create a follow-up jira for webhdfs.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Hi Omkar Aradhya K S , Steve Loughran John Zhuge , Thanks for the good discussion here. All are very good comments! I personally prefer a step by step approach, this jira can be a foundation to add new additional support and improvement, such as webhdfs, ADLs. It's easier to manage that way. The new jira will not only include the code change (even if it's minimum), but also some unit tests. Omkar, what about we create a follow-up jira for ADLs, and we work together on the jira after this jira is in? Steve, I will also create a follow-up jira for webhdfs. Thanks.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          the thing is: if we remove the checks for FS type, we don't increase code complexity, it's simplified: one less check. Just have the catch of the exception go to logging something

          Show
          stevel@apache.org Steve Loughran added a comment - the thing is: if we remove the checks for FS type, we don't increase code complexity, it's simplified: one less check. Just have the catch of the exception go to logging something
          Hide
          chris.douglas Chris Douglas added a comment -

          I agree with Steve Loughran. If the FileSystem doesn't support concat, then allowing the job to fail is a reasonable "foundation".

          Show
          chris.douglas Chris Douglas added a comment - I agree with Steve Loughran . If the FileSystem doesn't support concat, then allowing the job to fail is a reasonable "foundation".
          Hide
          yzhangal Yongjun Zhang added a comment -

          Hi Steve Loughran, Chris Douglas,

          Thanks for the feedback.

          I think if we know the job will fail, we want it to fail sooner rather than later. That's why I had the DistributedFileSystem check in the beginning of distcp.
          Imagine if we run the job half way and found it doesn't work, we not only wasted computing power, but also possibly left the cluster in an inconsistent state.

          So I think we should check if file system support getBlockLocations and concat in the very beginning of distcp. That in my opinion, can be done as a follow-up jira, because supporting a different filesystem involves not only the check here, but also new unit tests for the corresponding file systems, and system level testing. Does this make sense to you?

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Hi Steve Loughran , Chris Douglas , Thanks for the feedback. I think if we know the job will fail, we want it to fail sooner rather than later. That's why I had the DistributedFileSystem check in the beginning of distcp. Imagine if we run the job half way and found it doesn't work, we not only wasted computing power, but also possibly left the cluster in an inconsistent state. So I think we should check if file system support getBlockLocations and concat in the very beginning of distcp. That in my opinion, can be done as a follow-up jira, because supporting a different filesystem involves not only the check here, but also new unit tests for the corresponding file systems, and system level testing. Does this make sense to you? Thanks.
          Hide
          chris.douglas Chris Douglas added a comment -

          Omkar Aradhya K S, can you post your patch?

          I see your point, Yongjun Zhang, but shouldn't the cleanup/rollback code handle the inconsistency? Moreover, doesn't distcp also use append to support sync, without first verifying that the destination FS supports it? Wasted cycles are unlikely; this doesn't work inconsistently, it fails 100% of the time for unambiguous reasons. Surely someone would test this option before trying it on a significant deployment.

          To fail before submission, this could use concat during job setup if enabled e.g., parallelize the scan and concatenate the result for the input file [1]. More generally, distcp could add a phase to job setup that tests that the options are consistent with the capabilities of the src/dst FileSystems, but that would be an extension.

          The early check makes sense, but false positives are worse than false negatives, here.

          [1] Unfortunately, SequenceFile (format for distcp) would need some modifications to make that straightforward. There's an option to omit the header if the file already exists, but not one that explicitly and independently suppresses it.

          Show
          chris.douglas Chris Douglas added a comment - Omkar Aradhya K S , can you post your patch? I see your point, Yongjun Zhang , but shouldn't the cleanup/rollback code handle the inconsistency? Moreover, doesn't distcp also use append to support sync, without first verifying that the destination FS supports it? Wasted cycles are unlikely; this doesn't work inconsistently, it fails 100% of the time for unambiguous reasons. Surely someone would test this option before trying it on a significant deployment. To fail before submission, this could use concat during job setup if enabled e.g., parallelize the scan and concatenate the result for the input file [1] . More generally, distcp could add a phase to job setup that tests that the options are consistent with the capabilities of the src/dst FileSystems, but that would be an extension. The early check makes sense, but false positives are worse than false negatives, here. [1] Unfortunately, SequenceFile (format for distcp) would need some modifications to make that straightforward. There's an option to omit the header if the file already exists, but not one that explicitly and independently suppresses it.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Had a discussion with Aaron T. Myers (thanks ATM), we now agreed on an approach that

          • relax the file system checking,
          • adding the doc as Steve Loughran commented, stating concat need to be supported when the feature is turned on
          • creating follow-up jiras to add unit tests and validation for ADLS and other file systems
          • for file systems that don't support getBlockLocations or concat, create jiras to let them fail the run sooner.

          Please be aware of the situation I pointed out in my last comment, that if user enabled this feature for file systems that don't support concat, distcp may fail in the middle of a run and make the file system into an inconsistent state.

          Does that sound good to all?

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Had a discussion with Aaron T. Myers (thanks ATM), we now agreed on an approach that relax the file system checking, adding the doc as Steve Loughran commented, stating concat need to be supported when the feature is turned on creating follow-up jiras to add unit tests and validation for ADLS and other file systems for file systems that don't support getBlockLocations or concat, create jiras to let them fail the run sooner. Please be aware of the situation I pointed out in my last comment, that if user enabled this feature for file systems that don't support concat, distcp may fail in the middle of a run and make the file system into an inconsistent state. Does that sound good to all? Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          HI Chris Douglas,

          Sorry did not see your last post before I did my previous comment and I had to be offline for some time.

          In a case that source support getBlockLocations and target doesn't support concat, the current patch would split a file and copy them into chunk files at target, then at commit stage we will find concat doesn't work, thus the target is polluted (inconsistent state). At this point, distcp may have been running for very long time. To remedy that, I will add a "concat" check at the same place where I check DistributedFileSystem and catch UnsupportedOperationException (as Steve Loughran suggested). This will address the last item in my previous comment.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - HI Chris Douglas , Sorry did not see your last post before I did my previous comment and I had to be offline for some time. In a case that source support getBlockLocations and target doesn't support concat, the current patch would split a file and copy them into chunk files at target, then at commit stage we will find concat doesn't work, thus the target is polluted (inconsistent state). At this point, distcp may have been running for very long time. To remedy that, I will add a "concat" check at the same place where I check DistributedFileSystem and catch UnsupportedOperationException (as Steve Loughran suggested). This will address the last item in my previous comment. Thanks.
          Hide
          chris.douglas Chris Douglas added a comment -

          the current patch would split a file and copy them into chunk files at target, then at commit stage we will find concat doesn't work, thus the target is polluted (inconsistent state)

          Doesn't this imply that FileSystems that support concat can also be left in an inconsistent state? If the concat operation fails, the job is killed/fails/dies, etc. then distcp cleanup should remove partial work. If a FileSystem doesn't support concat, shouldn't that failure follow the same path?

          Either way, +1 on the approach you outline. Thanks for considering the feedback and updating the patch, Yongjun Zhang. Looking forward to this enhancement; it's been a long time coming!

          Show
          chris.douglas Chris Douglas added a comment - the current patch would split a file and copy them into chunk files at target, then at commit stage we will find concat doesn't work, thus the target is polluted (inconsistent state) Doesn't this imply that FileSystems that support concat can also be left in an inconsistent state? If the concat operation fails, the job is killed/fails/dies, etc. then distcp cleanup should remove partial work. If a FileSystem doesn't support concat, shouldn't that failure follow the same path? Either way, +1 on the approach you outline. Thanks for considering the feedback and updating the patch, Yongjun Zhang . Looking forward to this enhancement; it's been a long time coming!
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Chris Douglas.

          Uploaded rev9 as outlined. Would all reviewers please take a look? thanks a lot!

          Would appreciate if you could help testing ADLS with this version, Omkar Aradhya K S!

          With this patch:

          • removed DistributedFileSystem checking
          • in order to enable the feature, the source FS need to implements getBlockLocations and the target FS implements concat
          • check concat support at the beginning of distcp, throw exception when -blocksperchunk is passed and concat is not supported

          Doesn't this imply that FileSystems that support concat can also be left in an inconsistent state? If the concat operation fails, the job is killed/fails/dies, etc. then distcp cleanup should remove partial work. If a FileSystem doesn't support concat, shouldn't that failure follow the same path?

          Yes, indeed. However, if we run the same job again successfully, it will clean up the temporary chunk files. We could clean up the chunk files when concat fails if we really want. However, given concat is supported, if concat failed, we need to know why it failed, keeping the chunk files would help debugging. If the files are good, we have the potential of concat them manually. If distcp failed in the middle for other reason, the source and target will be different anyways.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Chris Douglas . Uploaded rev9 as outlined. Would all reviewers please take a look? thanks a lot! Would appreciate if you could help testing ADLS with this version, Omkar Aradhya K S ! With this patch: removed DistributedFileSystem checking in order to enable the feature, the source FS need to implements getBlockLocations and the target FS implements concat check concat support at the beginning of distcp, throw exception when -blocksperchunk is passed and concat is not supported Doesn't this imply that FileSystems that support concat can also be left in an inconsistent state? If the concat operation fails, the job is killed/fails/dies, etc. then distcp cleanup should remove partial work. If a FileSystem doesn't support concat, shouldn't that failure follow the same path? Yes, indeed. However, if we run the same job again successfully, it will clean up the temporary chunk files. We could clean up the chunk files when concat fails if we really want. However, given concat is supported, if concat failed, we need to know why it failed, keeping the chunk files would help debugging. If the files are good, we have the potential of concat them manually. If distcp failed in the middle for other reason, the source and target will be different anyways. Thanks.
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Yongjun Zhang Thanks for reconsidering the comments and making the required changes for FlieSystem compatibility.
          Steve Loughran, John Zhuge, Aaron T. Myers, Chris Douglas Thanks for providing clarity and a way ahead.

          Omkar Aradhya K S, can you post your patch?

          I made only the following changes in my patch to get it working with ADLS on hadoop 2.7:

          1. Remove all the checks for DistributedFileSystem
          2. Use
            fs.concat

            instead of

            dstdistfs.concat
          3. Use
            fs.getFileBlockLocations

            instead of

            dfs.getBlockLocations
          4. Use (now deprecated)
            final DFSClient dfs = new DFSClient(conf);

            instead of

            final DFSClient dfs = new DFSClient(DFSUtilClient.getNNAddress(conf), conf);

          Yongjun Zhang Once the new patch with all above changes is checked in, we need to back port it to older versions of hadoop, which will be addressed by new JIRAs?

          Show
          omkarksa Omkar Aradhya K S added a comment - Yongjun Zhang Thanks for reconsidering the comments and making the required changes for FlieSystem compatibility. Steve Loughran , John Zhuge , Aaron T. Myers , Chris Douglas Thanks for providing clarity and a way ahead. Omkar Aradhya K S, can you post your patch? I made only the following changes in my patch to get it working with ADLS on hadoop 2.7: Remove all the checks for DistributedFileSystem Use fs.concat instead of dstdistfs.concat Use fs.getFileBlockLocations instead of dfs.getBlockLocations Use (now deprecated) final DFSClient dfs = new DFSClient(conf); instead of final DFSClient dfs = new DFSClient(DFSUtilClient.getNNAddress(conf), conf); Yongjun Zhang Once the new patch with all above changes is checked in, we need to back port it to older versions of hadoop, which will be addressed by new JIRAs?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 16s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 14s Maven dependency ordering for branch
          +1 mvninstall 13m 18s trunk passed
          +1 compile 22m 14s trunk passed
          +1 checkstyle 1m 58s trunk passed
          +1 mvnsite 1m 23s trunk passed
          +1 mvneclipse 0m 43s trunk passed
          +1 findbugs 2m 22s trunk passed
          +1 javadoc 1m 10s trunk passed
          0 mvndep 0m 17s Maven dependency ordering for patch
          +1 mvninstall 1m 7s the patch passed
          +1 compile 15m 49s the patch passed
          -1 javac 15m 49s root generated 1 new + 777 unchanged - 0 fixed = 778 total (was 777)
          -0 checkstyle 2m 1s root: The patch generated 5 new + 377 unchanged - 12 fixed = 382 total (was 389)
          +1 mvnsite 1m 30s the patch passed
          +1 mvneclipse 0m 46s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 49s the patch passed
          +1 javadoc 0m 52s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 25s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          +1 unit 63m 52s hadoop-hdfs in the patch passed.
          +1 unit 12m 50s hadoop-distcp in the patch passed.
          +1 asflicense 0m 41s The patch does not generate ASF License warnings.
          171m 55s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12860084/HADOOP-11794.009.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 314982f8157f 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 59d6925
          Default Java 1.8.0_121
          findbugs v3.0.0
          javac https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/artifact/patchprocess/diff-compile-javac-root.txt
          checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/artifact/patchprocess/diff-checkstyle-root.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 14s Maven dependency ordering for branch +1 mvninstall 13m 18s trunk passed +1 compile 22m 14s trunk passed +1 checkstyle 1m 58s trunk passed +1 mvnsite 1m 23s trunk passed +1 mvneclipse 0m 43s trunk passed +1 findbugs 2m 22s trunk passed +1 javadoc 1m 10s trunk passed 0 mvndep 0m 17s Maven dependency ordering for patch +1 mvninstall 1m 7s the patch passed +1 compile 15m 49s the patch passed -1 javac 15m 49s root generated 1 new + 777 unchanged - 0 fixed = 778 total (was 777) -0 checkstyle 2m 1s root: The patch generated 5 new + 377 unchanged - 12 fixed = 382 total (was 389) +1 mvnsite 1m 30s the patch passed +1 mvneclipse 0m 46s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 49s the patch passed +1 javadoc 0m 52s hadoop-hdfs in the patch passed. +1 javadoc 0m 25s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) +1 unit 63m 52s hadoop-hdfs in the patch passed. +1 unit 12m 50s hadoop-distcp in the patch passed. +1 asflicense 0m 41s The patch does not generate ASF License warnings. 171m 55s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12860084/HADOOP-11794.009.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 314982f8157f 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 59d6925 Default Java 1.8.0_121 findbugs v3.0.0 javac https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/artifact/patchprocess/diff-compile-javac-root.txt checkstyle https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/artifact/patchprocess/diff-checkstyle-root.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11893/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I like the direction here. I undertstand while a fail-fast is good: otherwise it's the workers failing and the problem with reporting.That's something to consider later. For now: Caller had better know about the destination.

          Looking forward to seeing this in, and thanks to everyone engaged in testing it

          Omkar: if ADL doesn't implement the distcp contract test, you might want to follow up this patch with a distcp test that forces the use of the concat operation.

          Show
          stevel@apache.org Steve Loughran added a comment - I like the direction here. I undertstand while a fail-fast is good: otherwise it's the workers failing and the problem with reporting.That's something to consider later. For now: Caller had better know about the destination. Looking forward to seeing this in, and thanks to everyone engaged in testing it Omkar: if ADL doesn't implement the distcp contract test, you might want to follow up this patch with a distcp test that forces the use of the concat operation.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks much all!

          Omkar Aradhya K S,

          All the changes except one you need was already in rev9. I will add it in next rev.

          All:
          About

            /**
             * Same as this(NameNode.getNNAddress(conf), conf);
             * @see #DFSClient(InetSocketAddress, Configuration)
             * @deprecated Deprecated at 0.21
             */
            @Deprecated
            public DFSClient(Configuration conf) throws IOException {
              this(DFSUtilClient.getNNAddress(conf), conf);
            }
          

          I am not sure why this API is marked deprecated, it seems a good convenient API to exist. Wonder if we should remove the deprecation?

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks much all! Omkar Aradhya K S , All the changes except one you need was already in rev9. I will add it in next rev. All: About /** * Same as this (NameNode.getNNAddress(conf), conf); * @see #DFSClient(InetSocketAddress, Configuration) * @deprecated Deprecated at 0.21 */ @Deprecated public DFSClient(Configuration conf) throws IOException { this (DFSUtilClient.getNNAddress(conf), conf); } I am not sure why this API is marked deprecated, it seems a good convenient API to exist. Wonder if we should remove the deprecation? Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Uploaded rev10 to address remaining stuff.

          About the API mentioned in my last comment, it turned out we don't need to use it so let's leave it as is.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Uploaded rev10 to address remaining stuff. About the API mentioned in my last comment, it turned out we don't need to use it so let's leave it as is. Thanks.
          Hide
          chris.douglas Chris Douglas added a comment -

          Once the new patch with all above changes is checked in, we need to back port it to older versions of hadoop, which will be addressed by new JIRAs?

          This is usually handled in the same ticket, and by cherry-picking the patch. Backporting doesn't usually warrant a new JIRA unless the implementation is significantly different.

          I am not sure why this API is marked deprecated, it seems a good convenient API to exist. Wonder if we should remove the deprecation?

          If it was deprecated in 0.21, evidently we're not serious about removing it.

          Show
          chris.douglas Chris Douglas added a comment - Once the new patch with all above changes is checked in, we need to back port it to older versions of hadoop, which will be addressed by new JIRAs? This is usually handled in the same ticket, and by cherry-picking the patch. Backporting doesn't usually warrant a new JIRA unless the implementation is significantly different. I am not sure why this API is marked deprecated, it seems a good convenient API to exist. Wonder if we should remove the deprecation? If it was deprecated in 0.21, evidently we're not serious about removing it.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 4 new or modified test files.
          0 mvndep 0m 18s Maven dependency ordering for branch
          +1 mvninstall 15m 45s trunk passed
          +1 compile 22m 58s trunk passed
          +1 checkstyle 2m 7s trunk passed
          +1 mvnsite 1m 30s trunk passed
          +1 mvneclipse 0m 46s trunk passed
          +1 findbugs 2m 38s trunk passed
          +1 javadoc 1m 12s trunk passed
          0 mvndep 0m 18s Maven dependency ordering for patch
          +1 mvninstall 1m 13s the patch passed
          +1 compile 17m 15s the patch passed
          +1 javac 17m 15s the patch passed
          +1 checkstyle 2m 23s root: The patch generated 0 new + 377 unchanged - 13 fixed = 377 total (was 390)
          +1 mvnsite 1m 51s the patch passed
          +1 mvneclipse 0m 50s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 3m 12s the patch passed
          +1 javadoc 0m 56s hadoop-hdfs in the patch passed.
          +1 javadoc 0m 27s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49)
          -1 unit 76m 13s hadoop-hdfs in the patch failed.
          +1 unit 13m 46s hadoop-distcp in the patch passed.
          +1 asflicense 0m 44s The patch does not generate ASF License warnings.
          192m 12s



          Reason Tests
          Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure000
            hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12860188/HADOOP-11794.010.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 5de3607cd1d0 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 595f62e
          Default Java 1.8.0_121
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11898/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11898/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: .
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11898/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 4 new or modified test files. 0 mvndep 0m 18s Maven dependency ordering for branch +1 mvninstall 15m 45s trunk passed +1 compile 22m 58s trunk passed +1 checkstyle 2m 7s trunk passed +1 mvnsite 1m 30s trunk passed +1 mvneclipse 0m 46s trunk passed +1 findbugs 2m 38s trunk passed +1 javadoc 1m 12s trunk passed 0 mvndep 0m 18s Maven dependency ordering for patch +1 mvninstall 1m 13s the patch passed +1 compile 17m 15s the patch passed +1 javac 17m 15s the patch passed +1 checkstyle 2m 23s root: The patch generated 0 new + 377 unchanged - 13 fixed = 377 total (was 390) +1 mvnsite 1m 51s the patch passed +1 mvneclipse 0m 50s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 3m 12s the patch passed +1 javadoc 0m 56s hadoop-hdfs in the patch passed. +1 javadoc 0m 27s hadoop-tools_hadoop-distcp generated 0 new + 48 unchanged - 1 fixed = 48 total (was 49) -1 unit 76m 13s hadoop-hdfs in the patch failed. +1 unit 13m 46s hadoop-distcp in the patch passed. +1 asflicense 0m 44s The patch does not generate ASF License warnings. 192m 12s Reason Tests Failed junit tests hadoop.hdfs.TestDFSStripedOutputStreamWithFailure000   hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12860188/HADOOP-11794.010.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 5de3607cd1d0 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 595f62e Default Java 1.8.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HADOOP-Build/11898/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HADOOP-Build/11898/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-distcp U: . Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/11898/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          yzhangal Yongjun Zhang added a comment -

          The failed tests succeeded in local run. They look flaky and are not relevant to the change here.

          Show
          yzhangal Yongjun Zhang added a comment - The failed tests succeeded in local run. They look flaky and are not relevant to the change here.
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          This is usually handled in the same ticket, and by cherry-picking the patch. Backporting doesn't usually warrant a new JIRA unless the implementation is significantly different.

          Chris Douglas, Yongjun Zhang Thanks for all the clarifications.

          Show
          omkarksa Omkar Aradhya K S added a comment - This is usually handled in the same ticket, and by cherry-picking the patch. Backporting doesn't usually warrant a new JIRA unless the implementation is significantly different. Chris Douglas , Yongjun Zhang Thanks for all the clarifications.
          Hide
          chris.douglas Chris Douglas added a comment -

          +1 overall, though the DistCp docs currently claim:

          Both the source and the target FileSystem must be DistributedFileSystem
          

          With the relaxed check, this could read "The target FileSystem must support the FileSystem#concat operation".

          Omkar Aradhya K S, you may want to verify the latest patch works for ADLS.

          Show
          chris.douglas Chris Douglas added a comment - +1 overall, though the DistCp docs currently claim: Both the source and the target FileSystem must be DistributedFileSystem With the relaxed check, this could read "The target FileSystem must support the FileSystem#concat operation". Omkar Aradhya K S , you may want to verify the latest patch works for ADLS.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks Chris Douglas, the doc was already addressed in rev10. Maybe you were looking at the -rdiff section which is not relevant to the patch here? Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks Chris Douglas , the doc was already addressed in rev10. Maybe you were looking at the -rdiff section which is not relevant to the patch here? Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Hi Omkar Aradhya K S,

          I'd really appreciate if you could try out rev10 and confirm that it works with ADLS.

          Thanks a lot.

          Show
          yzhangal Yongjun Zhang added a comment - Hi Omkar Aradhya K S , I'd really appreciate if you could try out rev10 and confirm that it works with ADLS. Thanks a lot.
          Hide
          chris.douglas Chris Douglas added a comment -

          Maybe you were looking at the -rdiff section which is not relevant to the patch here?

          You're right. Sorry, I misread the diff.

          Show
          chris.douglas Chris Douglas added a comment - Maybe you were looking at the -rdiff section which is not relevant to the patch here? You're right. Sorry, I misread the diff.
          Hide
          yzhangal Yongjun Zhang added a comment -

          No problem and thanks for confirming Chris Douglas.

          Show
          yzhangal Yongjun Zhang added a comment - No problem and thanks for confirming Chris Douglas .
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Hi Yongjun,

          I am on vacation till tomorrow.
          Would it be late if I review it tomorrow?
          If you are held up because om me, I can reach home and try it today itself. Please let me know.

          Regards,
          Omkar

          Show
          omkarksa Omkar Aradhya K S added a comment - Hi Yongjun, I am on vacation till tomorrow. Would it be late if I review it tomorrow? If you are held up because om me, I can reach home and try it today itself. Please let me know. Regards, Omkar
          Hide
          yzhangal Yongjun Zhang added a comment -

          No problem to test it out after you are back from vacation Omkar Aradhya K S, many thanks!

          Show
          yzhangal Yongjun Zhang added a comment - No problem to test it out after you are back from vacation Omkar Aradhya K S , many thanks!
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Yongjun Zhang I have tested the patch with ADLS and it works without any changes. Thanks.

          Show
          omkarksa Omkar Aradhya K S added a comment - Yongjun Zhang I have tested the patch with ADLS and it works without any changes. Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thank you so much and great to hear Omkar Aradhya K S!

          Hi Chris Douglas Aaron T. Myers, your recent review indicates a very close +1, would you please help taking a look at the latest rev10 to see it all looks good? If so, we can get the patch in and follow-up with additional work if necessary. Other folks are welcome to review too. Thanks!

          Show
          yzhangal Yongjun Zhang added a comment - Thank you so much and great to hear Omkar Aradhya K S ! Hi Chris Douglas Aaron T. Myers , your recent review indicates a very close +1, would you please help taking a look at the latest rev10 to see it all looks good? If so, we can get the patch in and follow-up with additional work if necessary. Other folks are welcome to review too. Thanks!
          Hide
          chris.douglas Chris Douglas added a comment -

          I skimmed the patch, looks reasonable. +1

          Thanks for all the followup, Yongjun Zhang.

          Show
          chris.douglas Chris Douglas added a comment - I skimmed the patch, looks reasonable. +1 Thanks for all the followup, Yongjun Zhang .
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks a lot Chris Douglas! will commit soon.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks a lot Chris Douglas ! will commit soon.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11505 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11505/)
          HADOOP-11794. Enable distcp to copy blocks in parallel. Contributed by (yzhang: rev 064c8b25eca9bc825dc07a54d9147d65c9290a03)

          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListingFileStatus.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java
          • (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11505 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11505/ ) HADOOP-11794 . Enable distcp to copy blocks in parallel. Contributed by (yzhang: rev 064c8b25eca9bc825dc07a54d9147d65c9290a03) (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListingFileStatus.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
          Hide
          yzhangal Yongjun Zhang added a comment -

          I just committed to trunk. Will work on branch-2 version asap (tried and see quite some conflicts).

          Many thanks to many people! dhruba borthakur for reporting the issue, Rosie Li for the very initial patch (MAPREDUCE-2257), Wei-Chiu Chuang and Xiao Chen for the assistance when I worked on the initial patch of HADOOP-11794, Mithun Radhakrishnan, Aaron T. Myers, Steve Loughran, Chris Douglas for the review, Omkar Aradhya K S for reviewing and testing with ADLS, Andrew Wang and John Zhuge for the discussion!

          Show
          yzhangal Yongjun Zhang added a comment - I just committed to trunk. Will work on branch-2 version asap (tried and see quite some conflicts). Many thanks to many people! dhruba borthakur for reporting the issue, Rosie Li for the very initial patch ( MAPREDUCE-2257 ), Wei-Chiu Chuang and Xiao Chen for the assistance when I worked on the initial patch of HADOOP-11794 , Mithun Radhakrishnan , Aaron T. Myers , Steve Loughran , Chris Douglas for the review, Omkar Aradhya K S for reviewing and testing with ADLS, Andrew Wang and John Zhuge for the discussion!
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11506 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11506/)
          Revert "HADOOP-11794. Enable distcp to copy blocks in parallel. (yzhang: rev 144f1cf76527e6c75aec77ef683a898580f3cc8d)

          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListingFileStatus.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
            HADOOP-11794. Enable distcp to copy blocks in parallel. Contributed by (yzhang: rev bf3fb585aaf2b179836e139c041fc87920a3c886)
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListingFileStatus.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java
          • (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java
          • (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm
          • (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11506 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11506/ ) Revert " HADOOP-11794 . Enable distcp to copy blocks in parallel. (yzhang: rev 144f1cf76527e6c75aec77ef683a898580f3cc8d) (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListingFileStatus.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java HADOOP-11794 . Enable distcp to copy blocks in parallel. Contributed by (yzhang: rev bf3fb585aaf2b179836e139c041fc87920a3c886) (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCp.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpSystem.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/DFSTestUtil.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyMapper.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListing.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/CopyListingFileStatus.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptionSwitch.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/OptionsParser.java (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java (edit) hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestOptionsParser.java (edit) hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm (edit) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Yongjun Zhang Thanks for re-considering the suggestions and re-doing the patch to accommodate all FileSystem implementations.

          I just committed to trunk. Will work on branch-2 version asap (tried and see quite some conflicts).

          Could you please elaborate on how you plan to proceed with backporting?

          Show
          omkarksa Omkar Aradhya K S added a comment - Yongjun Zhang Thanks for re-considering the suggestions and re-doing the patch to accommodate all FileSystem implementations. I just committed to trunk. Will work on branch-2 version asap (tried and see quite some conflicts). Could you please elaborate on how you plan to proceed with backporting?
          Hide
          yzhangal Yongjun Zhang added a comment -

          Welcome Omkar Aradhya K S, I will be working on backporting to other branches asap. Do you have specific expectations?

          Just to clarify, I did not mean not to consider supporting other file system, rather, I was suggesting working on that as a separate jira. Your help on the testing out ADLS, together with Steve Loughran's suggestion about checking concat support (UnsupportedException) made it easier for us to relax the file system constraint in this jira. So thank you guys again!

          BTW, Steve still has an item for you to follow-up here

          https://issues.apache.org/jira/browse/HADOOP-11794?focusedCommentId=15938217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15938217

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Welcome Omkar Aradhya K S , I will be working on backporting to other branches asap. Do you have specific expectations? Just to clarify, I did not mean not to consider supporting other file system, rather, I was suggesting working on that as a separate jira. Your help on the testing out ADLS, together with Steve Loughran 's suggestion about checking concat support (UnsupportedException) made it easier for us to relax the file system constraint in this jira. So thank you guys again! BTW, Steve still has an item for you to follow-up here https://issues.apache.org/jira/browse/HADOOP-11794?focusedCommentId=15938217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15938217 Thanks.
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          BTW, Steve still has an item for you to follow-up here

          https://issues.apache.org/jira/browse/HADOOP-11794?focusedCommentId=15938217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15938217

          Yongjun Zhang Sorry for the late reply. Thanks for pointing this out. I almost missed this!

          Omkar: if ADL doesn't implement the distcp contract test, you might want to follow up this patch with a distcp test that forces the use of the concat operation.

          Steve Loughran I will look into this.

          Show
          omkarksa Omkar Aradhya K S added a comment - BTW, Steve still has an item for you to follow-up here https://issues.apache.org/jira/browse/HADOOP-11794?focusedCommentId=15938217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15938217 Yongjun Zhang Sorry for the late reply. Thanks for pointing this out. I almost missed this! Omkar: if ADL doesn't implement the distcp contract test, you might want to follow up this patch with a distcp test that forces the use of the concat operation. Steve Loughran I will look into this.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Welcome and thanks Omkar Aradhya K S.

          BTW, I have a branch-2 version of HADOOP-11794, for which I backported HADOOP-13626 to branch-2 to make things cleaner. Hi Chris Douglas, do we have any concern not to have HADOOP-13626 in branch-2 earlier? if not, I will commit it (it's a clean one), then post the branch-2 version here.

          Thanks.

          Show
          yzhangal Yongjun Zhang added a comment - Welcome and thanks Omkar Aradhya K S . BTW, I have a branch-2 version of HADOOP-11794 , for which I backported HADOOP-13626 to branch-2 to make things cleaner. Hi Chris Douglas , do we have any concern not to have HADOOP-13626 in branch-2 earlier? if not, I will commit it (it's a clean one), then post the branch-2 version here. Thanks.
          Hide
          yzhangal Yongjun Zhang added a comment -

          HI Chris Douglas,

          Thanks for confirming that HADOOP-13626 can be put into branch-2 and I have just committed it. Now uploaded branch-2 patch for this jira here. Resolved some misc conflicts, would you please help taking a look?

          Hi Omkar Aradhya K S, wonder if you could help run this branch-2 patch on ADLS too if possible?

          Thanks a lot!

          Show
          yzhangal Yongjun Zhang added a comment - HI Chris Douglas , Thanks for confirming that HADOOP-13626 can be put into branch-2 and I have just committed it. Now uploaded branch-2 patch for this jira here. Resolved some misc conflicts, would you please help taking a look? Hi Omkar Aradhya K S , wonder if you could help run this branch-2 patch on ADLS too if possible? Thanks a lot!
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 0s Docker mode activated.
          -1 patch 0m 8s HADOOP-11794 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help.



          Subsystem Report/Notes
          JIRA Issue HADOOP-11794
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12862192/HADOOP-11794.010.branch2.patch
          Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12035/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. -1 patch 0m 8s HADOOP-11794 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. Subsystem Report/Notes JIRA Issue HADOOP-11794 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12862192/HADOOP-11794.010.branch2.patch Console output https://builds.apache.org/job/PreCommit-HADOOP-Build/12035/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Hi Omkar Aradhya K S, wonder if you could help run this branch-2 patch on ADLS too if possible?

          Yongjun Zhang Sure, I will finish testing this by early next week.

          Show
          omkarksa Omkar Aradhya K S added a comment - Hi Omkar Aradhya K S, wonder if you could help run this branch-2 patch on ADLS too if possible? Yongjun Zhang Sure, I will finish testing this by early next week.
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Yongjun Zhang Sure, I will finish testing this by early next week.

          Yongjun Zhang I was able to do some basic tests and it works! Thanks for the patch.

          The branch-2 is 2.9.0. However, will this patch work on older versions like 2.2.x?

          Show
          omkarksa Omkar Aradhya K S added a comment - Yongjun Zhang Sure, I will finish testing this by early next week. Yongjun Zhang I was able to do some basic tests and it works! Thanks for the patch. The branch-2 is 2.9.0 . However, will this patch work on older versions like 2.2.x ?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Omkar: what version are you looking at? We could talk about a backport to 2.8.1, but given it's a feature I don't see it being pulled back any earlier

          Show
          stevel@apache.org Steve Loughran added a comment - Omkar: what version are you looking at? We could talk about a backport to 2.8.1, but given it's a feature I don't see it being pulled back any earlier
          Hide
          omkarksa Omkar Aradhya K S added a comment -

          Steve Loughran I was able to test the bits with HDI 3.3, which is 2.7.1.
          However, I was wondering if we can go as back as 2.5.x/2.2.x?

          Show
          omkarksa Omkar Aradhya K S added a comment - Steve Loughran I was able to test the bits with HDI 3.3, which is 2.7.1 . However, I was wondering if we can go as back as 2.5.x / 2.2.x ?
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Steve Loughran I was able to test the bits with HDI 3.3, which is 2.7.1.

          good. Something related to backporting that can be handled.

          if we can go as back as 2.5.x/2.2.x?

          Sorry, we don't go near that, especially for a new feature. 2.7 is pretty much the limit for backports, unless someone using 2.6 needs something. Twitter do, which is why they lead the 2.6.x releases.

          Show
          stevel@apache.org Steve Loughran added a comment - Steve Loughran I was able to test the bits with HDI 3.3, which is 2.7.1. good. Something related to backporting that can be handled. if we can go as back as 2.5.x/2.2.x? Sorry, we don't go near that, especially for a new feature. 2.7 is pretty much the limit for backports, unless someone using 2.6 needs something. Twitter do, which is why they lead the 2.6.x releases.
          Hide
          yzhangal Yongjun Zhang added a comment -

          Thanks much Omkar Aradhya K S and Steve Loughran.

          Sorry I was out for a few days. I just committed to branch-2.

          Show
          yzhangal Yongjun Zhang added a comment - Thanks much Omkar Aradhya K S and Steve Loughran . Sorry I was out for a few days. I just committed to branch-2.
          Hide
          jrottinghuis Joep Rottinghuis added a comment -

          Thanks. We'll probably pick this up with a 2.9 release. Exciting!

          Show
          jrottinghuis Joep Rottinghuis added a comment - Thanks. We'll probably pick this up with a 2.9 release. Exciting!
          Hide
          yzhangal Yongjun Zhang added a comment -

          Updated branch-2 patch HADOOP-11794.010.branch2.002.patch to fix TestDistCpOptions failure.

          Show
          yzhangal Yongjun Zhang added a comment - Updated branch-2 patch HADOOP-11794 .010.branch2.002.patch to fix TestDistCpOptions failure.
          Hide
          HuafengWang Huafeng Wang added a comment -

          Hi guys, I noticed that the patch for branch2 targets to version 2.9.x and we have the same requirement for version 2.8.x. I tried to backport the patch to branch-2.8.2 and found most of the code is compatible. I filed a new issue and the patch is available. Please take a look if you are interested.

          Show
          HuafengWang Huafeng Wang added a comment - Hi guys, I noticed that the patch for branch2 targets to version 2.9.x and we have the same requirement for version 2.8.x. I tried to backport the patch to branch-2.8.2 and found most of the code is compatible. I filed a new issue and the patch is available. Please take a look if you are interested.

            People

            • Assignee:
              yzhangal Yongjun Zhang
              Reporter:
              dhruba dhruba borthakur
            • Votes:
              4 Vote for this issue
              Watchers:
              56 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development