Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11785

Reduce number of listStatus operation in distcp buildListing()

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha1
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: tools/distcp
    • Labels:
      None
    • Target Version/s:

      Description

      Distcp was taking long time in copyListing.buildListing() for large source trees (I was using source of 1.5M files in a tree of about 50K directories). For input at s3 buildListing was taking more than one hour. I've noticed a performance bug in the current code which does listStatus twice for each directory which doubles number of RPCs in some cases (if most directories do not contain >1000 files).

      1. distcp-liststatus.patch
        6 kB
        Zoran Dimitrijevic
      2. distcp-liststatus2.patch
        4 kB
        Zoran Dimitrijevic

        Activity

        Hide
        aw Allen Wittenauer added a comment -

        Removing fix version, since that should be set at commit time

        Show
        aw Allen Wittenauer added a comment - Removing fix version, since that should be set at commit time
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12708737/distcp-liststatus.patch
        against trunk revision 4922394.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-tools/hadoop-distcp.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708737/distcp-liststatus.patch against trunk revision 4922394. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-tools/hadoop-distcp. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6041//console This message is automatically generated.
        Hide
        3opan Zoran Dimitrijevic added a comment -

        I did not change any tests since this is a performance-bug fix and all existing tests pass.

        Should I mark this as a bug fix instead of improvement?

        Show
        3opan Zoran Dimitrijevic added a comment - I did not change any tests since this is a performance-bug fix and all existing tests pass. Should I mark this as a bug fix instead of improvement?
        Hide
        cmccabe Colin P. McCabe added a comment -

        Thanks, Zoran Dimitrijevic. This looks good in general.

        Should I mark this as a bug fix instead of improvement?

        I don't see this as a bug because the functionality is correct. It seems to be an improvement.

        -   * Collect the list of 
        +   * Collect the list of
        -   *     the the source root is a directory, then the source root entry is not 
        +   *     the the source root is a directory, then the source root entry is not
        -    if (fileStatus.getPath().equals(sourcePathRoot) && 
        +    if (fileStatus.getPath().equals(sourcePathRoot) &&
        

        Can you remove these whitespace changes from the patch? It's distracting and it makes it look like things have changed, when in fact they have not. I think there are a few other whitespace changes as well.

        traverseDirectory: Maybe we can optimize this even more. Can we pass in the sourceFS to this function, rather than calling Path#getFileSystem? Path#getFileSystem requires some synchronization which might add overheads.

        It looks good aside from that. thanks

        Show
        cmccabe Colin P. McCabe added a comment - Thanks, Zoran Dimitrijevic . This looks good in general. Should I mark this as a bug fix instead of improvement? I don't see this as a bug because the functionality is correct. It seems to be an improvement. - * Collect the list of + * Collect the list of - * the the source root is a directory, then the source root entry is not + * the the source root is a directory, then the source root entry is not - if (fileStatus.getPath().equals(sourcePathRoot) && + if (fileStatus.getPath().equals(sourcePathRoot) && Can you remove these whitespace changes from the patch? It's distracting and it makes it look like things have changed, when in fact they have not. I think there are a few other whitespace changes as well. traverseDirectory : Maybe we can optimize this even more. Can we pass in the sourceFS to this function, rather than calling Path#getFileSystem ? Path#getFileSystem requires some synchronization which might add overheads. It looks good aside from that. thanks
        Hide
        3opan Zoran Dimitrijevic added a comment -

        removed white space diffs.
        reused FileSystem sourceFS in traverse as suggested.

        Show
        3opan Zoran Dimitrijevic added a comment - removed white space diffs. reused FileSystem sourceFS in traverse as suggested.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12708802/distcp-liststatus2.patch
        against trunk revision c94d594.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-tools/hadoop-distcp.

        Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708802/distcp-liststatus2.patch against trunk revision c94d594. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-tools/hadoop-distcp. Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//testReport/ Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/6045//console This message is automatically generated.
        Hide
        cmccabe Colin P. McCabe added a comment -

        +1. Thanks, Zoran Dimitrijevic

        Show
        cmccabe Colin P. McCabe added a comment - +1. Thanks, Zoran Dimitrijevic
        Hide
        cmccabe Colin P. McCabe added a comment -

        committed to 2.8

        Show
        cmccabe Colin P. McCabe added a comment - committed to 2.8
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #7508 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7508/)
        HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

        • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7508 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7508/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #153 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/153/)
        HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #153 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/153/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk #887 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/887/)
        HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

        • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #887 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/887/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #144 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/144/)
        HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

        • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        • hadoop-common-project/hadoop-common/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #144 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/144/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java hadoop-common-project/hadoop-common/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Hdfs-trunk #2085 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2085/)
        HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #2085 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2085/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #154 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/154/)
        HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #154 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/154/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2103 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2103/)
        HADOOP-11785. Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a)

        • hadoop-common-project/hadoop-common/CHANGES.txt
        • hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2103 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2103/ ) HADOOP-11785 . Reduce the number of listStatus operation in distcp buildListing (Zoran Dimitrijevic via Colin P. McCabe) (cmccabe: rev 932730df7d62077f7356464ad27f69469965d77a) hadoop-common-project/hadoop-common/CHANGES.txt hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/SimpleCopyListing.java

          People

          • Assignee:
            3opan Zoran Dimitrijevic
            Reporter:
            3opan Zoran Dimitrijevic
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1h
              1h
              Remaining:
              Remaining Estimate - 1h
              1h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development