Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11785

Reduce number of listStatus operation in distcp buildListing()

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0-alpha1
    • 2.8.0, 3.0.0-alpha1
    • tools/distcp
    • None

    Description

      Distcp was taking long time in copyListing.buildListing() for large source trees (I was using source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing was taking more than one hour. I've noticed a performance bug in the current code which does listStatus twice for each directory which doubles number of RPCs in some cases (if most directories do not contain >1000 files).

      Attachments

        1. distcp-liststatus.patch
          6 kB
          Zoran Dimitrijevic
        2. distcp-liststatus2.patch
          4 kB
          Zoran Dimitrijevic

        Activity

          Removing fix version, since that should be set at commit time

          aw Allen Wittenauer added a comment - Removing fix version, since that should be set at commit time
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708737/distcp-liststatus.patch
          against trunk revision 4922394.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-tools/hadoop-distcp.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708737/distcp-liststatus.patch against trunk revision 4922394. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-tools/hadoop-distcp. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//console This message is automatically generated.

          I did not change any tests since this is a performance-bug fix and all existing tests pass.

          Should I mark this as a bug fix instead of improvement?

          3opan Zoran Dimitrijevic added a comment - I did not change any tests since this is a performance-bug fix and all existing tests pass. Should I mark this as a bug fix instead of improvement?
          cmccabe Colin McCabe added a comment -

          Thanks, 3opan. This looks good in general.

          Should I mark this as a bug fix instead of improvement?

          I don't see this as a bug because the functionality is correct. It seems to be an improvement.

          -   * Collect the list of 
          +   * Collect the list of
          -   *     the the source root is a directory, then the source root entry is not 
          +   *     the the source root is a directory, then the source root entry is not
          -    if (fileStatus.getPath().equals(sourcePathRoot) && 
          +    if (fileStatus.getPath().equals(sourcePathRoot) &&
          

          Can you remove these whitespace changes from the patch? It's distracting and it makes it look like things have changed, when in fact they have not. I think there are a few other whitespace changes as well.

          traverseDirectory: Maybe we can optimize this even more. Can we pass in the sourceFS to this function, rather than calling Path#getFileSystem? Path#getFileSystem requires some synchronization which might add overheads.

          It looks good aside from that. thanks

          cmccabe Colin McCabe added a comment - Thanks, 3opan . This looks good in general. Should I mark this as a bug fix instead of improvement? I don't see this as a bug because the functionality is correct. It seems to be an improvement. - * Collect the list of + * Collect the list of - * the the source root is a directory, then the source root entry is not + * the the source root is a directory, then the source root entry is not - if (fileStatus.getPath().equals(sourcePathRoot) && + if (fileStatus.getPath().equals(sourcePathRoot) && Can you remove these whitespace changes from the patch? It's distracting and it makes it look like things have changed, when in fact they have not. I think there are a few other whitespace changes as well. traverseDirectory : Maybe we can optimize this even more. Can we pass in the sourceFS to this function, rather than calling Path#getFileSystem ? Path#getFileSystem requires some synchronization which might add overheads. It looks good aside from that. thanks

          removed white space diffs.
          reused FileSystem sourceFS in traverse as suggested.

          3opan Zoran Dimitrijevic added a comment - removed white space diffs. reused FileSystem sourceFS in traverse as suggested.
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12708802/distcp-liststatus2.patch
          against trunk revision c94d594.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-tools/hadoop-distcp.

          Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708802/distcp-liststatus2.patch against trunk revision c94d594. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-tools/hadoop-distcp. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//console This message is automatically generated.
          cmccabe Colin McCabe added a comment -

          +1. Thanks, 3opan

          cmccabe Colin McCabe added a comment - +1. Thanks, 3opan
          cmccabe Colin McCabe added a comment -

          committed to 2.8

          cmccabe Colin McCabe added a comment - committed to 2.8
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7508 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7508/)
          HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

          • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7508 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7508/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java hadoop-common-project/hadoop-common/CHANGES.txt
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #153 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/153/)
          HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #153 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/153/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #887 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/887/)
          HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

          • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #887 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/887/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java hadoop-common-project/hadoop-common/CHANGES.txt
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #144 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/144/)
          HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

          • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          • hadoop-common-project/hadoop-common/CHANGES.txt
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #144 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/144/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java hadoop-common-project/hadoop-common/CHANGES.txt
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Hdfs-trunk #2085 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2085/)
          HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #2085 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2085/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #154 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/154/)
          HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #154 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/154/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2103 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2103/)
          HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

          • hadoop-common-project/hadoop-common/CHANGES.txt
          • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2103 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2103/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java

          People

            3opan Zoran Dimitrijevic
            3opan Zoran Dimitrijevic
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified