Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.0.0-alpha1
-
None
Description
Distcp was taking long time in copyListing.buildListing() for large source trees (I was using source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing was taking more than one hour. I've noticed a performance bug in the current code which does listStatus twice for each directory which doubles number of RPCs in some cases (if most directories do not contain >1000 files).
Attachments
Attachments
- distcp-liststatus.patch
- 6 kB
- Zoran Dimitrijevic
- distcp-liststatus2.patch
- 4 kB
- Zoran Dimitrijevic
Activity
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12708737/distcp-liststatus.patch
against trunk revision 4922394.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-tools/hadoop-distcp.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//console
This message is automatically generated.
I did not change any tests since this is a performance-bug fix and all existing tests pass.
Should I mark this as a bug fix instead of improvement?
Thanks, 3opan. This looks good in general.
Should I mark this as a bug fix instead of improvement?
I don't see this as a bug because the functionality is correct. It seems to be an improvement.
- * Collect the list of + * Collect the list of - * the the source root is a directory, then the source root entry is not + * the the source root is a directory, then the source root entry is not - if (fileStatus.getPath().equals(sourcePathRoot) && + if (fileStatus.getPath().equals(sourcePathRoot) &&
Can you remove these whitespace changes from the patch? It's distracting and it makes it look like things have changed, when in fact they have not. I think there are a few other whitespace changes as well.
traverseDirectory: Maybe we can optimize this even more. Can we pass in the sourceFS to this function, rather than calling Path#getFileSystem? Path#getFileSystem requires some synchronization which might add overheads.
It looks good aside from that. thanks
removed white space diffs.
reused FileSystem sourceFS in traverse as suggested.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12708802/distcp-liststatus2.patch
against trunk revision c94d594.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-tools/hadoop-distcp.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//console
This message is automatically generated.
FAILURE: Integrated in Hadoop-trunk-Commit #7508 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7508/)
HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)
- hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
- hadoop-common-project/hadoop-common/CHANGES.txt
FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #153 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/153/)
HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)
- hadoop-common-project/hadoop-common/CHANGES.txt
- hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
FAILURE: Integrated in Hadoop-Yarn-trunk #887 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/887/)
HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)
- hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
- hadoop-common-project/hadoop-common/CHANGES.txt
FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #144 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/144/)
HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)
- hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
- hadoop-common-project/hadoop-common/CHANGES.txt
SUCCESS: Integrated in Hadoop-Hdfs-trunk #2085 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2085/)
HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)
- hadoop-common-project/hadoop-common/CHANGES.txt
- hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #154 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/154/)
HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)
- hadoop-common-project/hadoop-common/CHANGES.txt
- hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
FAILURE: Integrated in Hadoop-Mapreduce-trunk #2103 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2103/)
HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)
- hadoop-common-project/hadoop-common/CHANGES.txt
- hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
Removing fix version, since that should be set at commit time