Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1423

Improve performance of CombineFileInputFormat when multiple pools are configured

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: client
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      MAPREDUCE-1423. Improve performance of CombineFileInputFormat when multiple pools are configured. (Dhruba Borthakur via zshao)
      Show
      MAPREDUCE-1423 . Improve performance of CombineFileInputFormat when multiple pools are configured. (Dhruba Borthakur via zshao)
    • Tags:
      combinefileinputformat

      Description

      I have a map-reduce job that is using CombineFileInputFormat. It has configured 10000 pools and 30000 files. The time to create the splits takes more than an hour. The reaosn being that CombineFileInputFormat.getSplits() converts the same path from String to Path object multiple times, one for each instance of a pool. Similarly, it calls Path.toUri(0 multiple times. This code can be optimized.

      1. CombineFileInputFormatPerformance.txt
        8 kB
        dhruba borthakur
      2. CombineFileInputFormatPerformance.txt
        6 kB
        dhruba borthakur

        Activity

        Hide
        dhruba borthakur added a comment -

        The conversion of strings to Path() occurs only once. In the presence of multiple pools, this improves performance by an order of magnitude. A job that needed 6 hours to create splits now takes about 1.5 hours.

        Show
        dhruba borthakur added a comment - The conversion of strings to Path() occurs only once. In the presence of multiple pools, this improves performance by an order of magnitude. A job that needed 6 hours to create splits now takes about 1.5 hours.
        Hide
        dhruba borthakur added a comment -

        Merged patch to latest trunk. Removed static variabel so that this class is thread-safe. This is ready for review.

        Show
        dhruba borthakur added a comment - Merged patch to latest trunk. Removed static variabel so that this class is thread-safe. This is ready for review.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12435766/CombineFileInputFormatPerformance.txt
        against trunk revision 909340.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435766/CombineFileInputFormatPerformance.txt against trunk revision 909340. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/450/console This message is automatically generated.
        Hide
        dhruba borthakur added a comment -

        trigger Hudson QA tests

        Show
        dhruba borthakur added a comment - trigger Hudson QA tests
        Hide
        dhruba borthakur added a comment -

        trigger hudson QA tests

        Show
        dhruba borthakur added a comment - trigger hudson QA tests
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12435766/CombineFileInputFormatPerformance.txt
        against trunk revision 912471.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435766/CombineFileInputFormatPerformance.txt against trunk revision 912471. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/470/console This message is automatically generated.
        Hide
        dhruba borthakur added a comment -

        The unit test failure is not related to this patch at all.

        Show
        dhruba borthakur added a comment - The unit test failure is not related to this patch at all.
        Hide
        dhruba borthakur added a comment -

        I verified that the failed test TestMiniMRLocalFS.testWithLocal fails even without this patch.

        Show
        dhruba borthakur added a comment - I verified that the failed test TestMiniMRLocalFS.testWithLocal fails even without this patch.
        Hide
        Zheng Shao added a comment -

        Committed. Thanks Dhruba!

        Show
        Zheng Shao added a comment - Committed. Thanks Dhruba!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #256 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/256/)
        . Improve performance of CombineFileInputFormat when multiple pools are configured. (Dhruba Borthakur via zshao)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #256 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/256/ ) . Improve performance of CombineFileInputFormat when multiple pools are configured. (Dhruba Borthakur via zshao)
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #248 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/248/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #248 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/248/ )

          People

          • Assignee:
            dhruba borthakur
            Reporter:
            dhruba borthakur
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development