Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-4808

Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.3-alpha
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Now that Shuffle is pluggable (MAPREDUCE-4049), it would be convenient for alternate implementations to be able to reuse portions of the default implementation.

      This would come with the strong caveat that these classes are LimitedPrivate and Unstable.

      1. M4808-1.patch
        103 kB
        Chris Douglas
      2. M4808-0.patch
        103 kB
        Chris Douglas
      3. mapreduce-4808.patch
        45 kB
        Mariappan Asokan
      4. MR-4808.patch
        51 kB
        Alejandro Abdelnur
      5. mapreduce-4808.patch
        62 kB
        Mariappan Asokan
      6. mapreduce-4808.patch
        61 kB
        Mariappan Asokan
      7. mapreduce-4808.patch
        78 kB
        Mariappan Asokan
      8. mapreduce-4808.patch
        71 kB
        Mariappan Asokan
      9. MergeManagerPlugin.pdf
        47 kB
        Mariappan Asokan
      10. mapreduce-4808.patch
        67 kB
        Mariappan Asokan
      11. mapreduce-4808.patch
        67 kB
        Mariappan Asokan
      12. COMBO-mapreduce-4809-4812-4808.patch
        84 kB
        Mariappan Asokan
      13. mapreduce-4808.patch
        39 kB
        Mariappan Asokan

        Activity

        Hide
        Mariappan Asokan added a comment -

        Hi Arun and Alejandro,
        Please take a look at the bare-bones version of the patch in MAPREDUCE-4812. I will come back to this one once that is committed. Arun, as per your suggestion ReduceInputMerger plugin will be passed to Shuffle. However, to implement the sort avoidance plugin I need MAPREDUCE-4049(shuffle plugin) to be committed.

        Thanks.
        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Arun and Alejandro, Please take a look at the bare-bones version of the patch in MAPREDUCE-4812 . I will come back to this one once that is committed. Arun, as per your suggestion ReduceInputMerger plugin will be passed to Shuffle. However, to implement the sort avoidance plugin I need MAPREDUCE-4049 (shuffle plugin) to be committed. Thanks. – Asokan
        Hide
        Mariappan Asokan added a comment -

        Hi Arun and Alejandro,
        I have uploaded the incremental patch as well as the combo patch. The combo patch still has a mock plugin test. I will attach the original end-to-end test once we have MAPREDUCE-4809, 4807, 4812, 4049, and 4048 are committed in that order.

        Thanks.
        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Arun and Alejandro, I have uploaded the incremental patch as well as the combo patch. The combo patch still has a mock plugin test. I will attach the original end-to-end test once we have MAPREDUCE-4809 , 4807, 4812, 4049, and 4048 are committed in that order. Thanks. – Asokan
        Hide
        Mariappan Asokan added a comment -

        I meant 4808 not 4048 in the above comment.

        Show
        Mariappan Asokan added a comment - I meant 4808 not 4048 in the above comment.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12554898/COMBO-mapreduce-4809-4812-4808.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3070//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3070//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12554898/COMBO-mapreduce-4809-4812-4808.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3070//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3070//console This message is automatically generated.
        Hide
        Mariappan Asokan added a comment -

        Hi Arun and Alejandro,
        I picked up Avner's patch for MAPREDUCE-4049, made some minor changes and merged it with MAPREDUCE-4809, MAPREDUCE-4807, MAPREDUCE-4812, and MAPREDUCE-4808 to build an experimental version. I modified the original end-to-end test(which does a copy operation both on the map and reduce sides) and it ran successfully with this experimental version.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Arun and Alejandro, I picked up Avner's patch for MAPREDUCE-4049 , made some minor changes and merged it with MAPREDUCE-4809 , MAPREDUCE-4807 , MAPREDUCE-4812 , and MAPREDUCE-4808 to build an experimental version. I modified the original end-to-end test(which does a copy operation both on the map and reduce sides) and it ran successfully with this experimental version. – Asokan
        Hide
        Arun C Murthy added a comment -

        Asokan, apologies for the delay.

        I'm very confused about this patch.

        Can you please explain why you need ReduceInputMerger and why it has only some of the interfaces in MergeManager?

        How do you intend to use this for SyncSort?

        Thanks.

        Show
        Arun C Murthy added a comment - Asokan, apologies for the delay. I'm very confused about this patch. Can you please explain why you need ReduceInputMerger and why it has only some of the interfaces in MergeManager? How do you intend to use this for SyncSort? Thanks.
        Hide
        Arun C Murthy added a comment -

        To be clear: my preference would be to just use MergeManager as the interface.

        IAC, I feel I just don't understand the requirements - can you pls elaborate?

        Show
        Arun C Murthy added a comment - To be clear: my preference would be to just use MergeManager as the interface. IAC, I feel I just don't understand the requirements - can you pls elaborate?
        Hide
        Arun C Murthy added a comment -

        On second thoughts, a simpler solution: why don't we use a simplified version of Merger as the interface?

        This way Syncsort can just implement that, why bother with trying to deal with memory reservation etc.?

        Show
        Arun C Murthy added a comment - On second thoughts, a simpler solution: why don't we use a simplified version of Merger as the interface? This way Syncsort can just implement that, why bother with trying to deal with memory reservation etc.?
        Hide
        Alejandro Abdelnur added a comment -

        Arun, what do you exactly mean?

        Both Merger and MergeManager are classes.

        The reducer side of logic in the Merger class is used only in 2 places, within the MergeManager (used by the Shuffle) and in the ReduceTask (for the local case).

        This patch is moving the later use into the MergeManager as well. By doing this all reduce merge logic, for the local and the distributed case, is encapsulated in the MergerManager.

        Then, when an alternate implementation is provided, it can handle both cases, local and distributed case.

        The resulting interface being introduced as ReduceInputMerger is quite simple:

          public void init(Context<K, V> reduceMergerContext);
          public void waitForResource() throws InterruptedException;
          public MapOutput<K, V> reserve(TaskAttemptID mapId, long requestedSize,
                                         int fetcher) throws IOException;
          public RawKeyValueIterator close() throws Throwable;
        
          // To merge files created for a local job.
          public RawKeyValueIterator mergeLocalFiles(Path localFiles[])
            throws IOException;
        

        I think this is much simpler than trying modify things in the Merger, given that the merger is not directly used by the Shuffle, but through the MergeManager.

        Show
        Alejandro Abdelnur added a comment - Arun, what do you exactly mean? Both Merger and MergeManager are classes. The reducer side of logic in the Merger class is used only in 2 places, within the MergeManager (used by the Shuffle) and in the ReduceTask (for the local case). This patch is moving the later use into the MergeManager as well. By doing this all reduce merge logic, for the local and the distributed case, is encapsulated in the MergerManager. Then, when an alternate implementation is provided, it can handle both cases, local and distributed case. The resulting interface being introduced as ReduceInputMerger is quite simple: public void init(Context<K, V> reduceMergerContext); public void waitForResource() throws InterruptedException; public MapOutput<K, V> reserve(TaskAttemptID mapId, long requestedSize, int fetcher) throws IOException; public RawKeyValueIterator close() throws Throwable; // To merge files created for a local job. public RawKeyValueIterator mergeLocalFiles(Path localFiles[]) throws IOException; I think this is much simpler than trying modify things in the Merger, given that the merger is not directly used by the Shuffle, but through the MergeManager.
        Hide
        Mariappan Asokan added a comment -

        Hi Aun,
        Thanks for your feedback. Perhaps I should mention some use cases of a MergeManager plugin in addition to the technical details of the design mentioned here as well as in MAPREDUCE-4812.

        MergeManager plugin would allow us and any implementer of the plugin to do variety of additional transformations like copy, limit-N query(MAPREDUCE-1928), full join, and hashed aggregation more efficiently. Since shuffle code is available in the framework, we want to make use of it. In my opinion, the framework shuffle code seems to be stable in MRv2.

        Making Merger to be pluggable will not add much value. If I understand correctly, it allows plugin implementers to implement only a single pass of the merge. The overall merge is still driven by MergeManager. Also, there is only merge operation possible. Any additional transformation has to be done in the Reducer only. A lot of times this is not very efficient.

        Hope I clarified the usefulness of allowing MergeManager to be pluggable. Please feel free if you any questions.

        Thanks.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Aun, Thanks for your feedback. Perhaps I should mention some use cases of a MergeManager plugin in addition to the technical details of the design mentioned here as well as in MAPREDUCE-4812 . MergeManager plugin would allow us and any implementer of the plugin to do variety of additional transformations like copy, limit-N query( MAPREDUCE-1928 ), full join, and hashed aggregation more efficiently. Since shuffle code is available in the framework, we want to make use of it. In my opinion, the framework shuffle code seems to be stable in MRv2. Making Merger to be pluggable will not add much value. If I understand correctly, it allows plugin implementers to implement only a single pass of the merge. The overall merge is still driven by MergeManager. Also, there is only merge operation possible. Any additional transformation has to be done in the Reducer only. A lot of times this is not very efficient. Hope I clarified the usefulness of allowing MergeManager to be pluggable. Please feel free if you any questions. Thanks. – Asokan
        Hide
        Arun C Murthy added a comment -

        Asokan, thanks for the clarification.

        However, I'm still trying to understand what you are trying to achieve here.

        The original goals of the parent task (MAPREDUCE-2454) was to make 'sort pluggable'.

        We've accomplished that with MAPREDUCE-4807 and MAPREDUCE-4809.

        Now, are we done? If not, what else is remaining to achieve that? Do you need some special hook in the Reducer's merge for Syncsort?

        As I've told you in person, when making sweeping changes to framework it's better to focus on the 'goal' and make as minimal changes to get there.

        We can always do more work and add more features, but let's do one thing at a time. We can add limit-N etc. separately, it just delays this jira - why do that?

        Show
        Arun C Murthy added a comment - Asokan, thanks for the clarification. However, I'm still trying to understand what you are trying to achieve here. The original goals of the parent task ( MAPREDUCE-2454 ) was to make 'sort pluggable'. We've accomplished that with MAPREDUCE-4807 and MAPREDUCE-4809 . Now, are we done? If not, what else is remaining to achieve that? Do you need some special hook in the Reducer's merge for Syncsort? As I've told you in person, when making sweeping changes to framework it's better to focus on the 'goal' and make as minimal changes to get there. We can always do more work and add more features, but let's do one thing at a time. We can add limit-N etc. separately, it just delays this jira - why do that?
        Hide
        Arun C Murthy added a comment -

        To be clear, I'm not against newer features - I just want them done independently so we can close this out and be done with.

        Show
        Arun C Murthy added a comment - To be clear, I'm not against newer features - I just want them done independently so we can close this out and be done with.
        Hide
        Arun C Murthy added a comment -

        Asokan? Alejandro?

        Can you pls confirm if we need this for MAPREDUCE-2454? If not, we can decouple this and do it separately after merging MAPREDUCE-2454 to trunk.

        Thoughts?

        Show
        Arun C Murthy added a comment - Asokan? Alejandro? Can you pls confirm if we need this for MAPREDUCE-2454 ? If not, we can decouple this and do it separately after merging MAPREDUCE-2454 to trunk. Thoughts?
        Hide
        Chris Douglas added a comment -

        Can you pls confirm if we need this for MAPREDUCE-2454? If not, we can decouple this and do it separately after merging MAPREDUCE-2454 to trunk.

        From his comment on MAPREDUCE-4049, I think that's the consensus approach.

        Show
        Chris Douglas added a comment - Can you pls confirm if we need this for MAPREDUCE-2454 ? If not, we can decouple this and do it separately after merging MAPREDUCE-2454 to trunk. From his comment on MAPREDUCE-4049 , I think that's the consensus approach.
        Hide
        Alejandro Abdelnur added a comment -

        The idea is to merge what we have in branch MR-2454 (MAPREDUCE-4809 & MAPREDUCE-4807) in trunk and continue the remaining work in trunk.

        Show
        Alejandro Abdelnur added a comment - The idea is to merge what we have in branch MR-2454 ( MAPREDUCE-4809 & MAPREDUCE-4807 ) in trunk and continue the remaining work in trunk.
        Hide
        Arun C Murthy added a comment -

        I'll ask again - is there any remaining work required on reduce-side for Syncsort or are we done?

        For e.g. MAPREDUCE-4809 made changes to reduce-side (MapHost etc.) - are they now deemed unnecessary?

        Show
        Arun C Murthy added a comment - I'll ask again - is there any remaining work required on reduce-side for Syncsort or are we done? For e.g. MAPREDUCE-4809 made changes to reduce-side (MapHost etc.) - are they now deemed unnecessary?
        Hide
        Alejandro Abdelnur added a comment -

        Yes, there is work pending on the reducer side. Making the MergeManager pluggable (MAPREDUCE-4812), the refactoring of MapOuput into an abstract class and 2 concrete impls MEM & DISK and enabling the local case to use the plugable MergerManager (this JIRA). The changes in MAPREDUCE-4809 were of visibility only.

        Show
        Alejandro Abdelnur added a comment - Yes, there is work pending on the reducer side. Making the MergeManager pluggable ( MAPREDUCE-4812 ), the refactoring of MapOuput into an abstract class and 2 concrete impls MEM & DISK and enabling the local case to use the plugable MergerManager (this JIRA). The changes in MAPREDUCE-4809 were of visibility only.
        Hide
        Mariappan Asokan added a comment -

        I will briefly outline what the patch does and the rationale.

        • It makes the MergeManager pluggable. Rationale: MergeManager does the merge sorting on the reduce side and is part of the overall sorting that happens in MR data flow.
        • It makes MapOutput class overridable.
          Rationale: MergeManager plugin implementations can make efficient use of JVM memory for data shuffling and provide their own implementation of MapOutput.
        • It makes local job runs use MergeManager or plugin implementations to do the merge sort instead of Merger.
          Rationale: Local job runs should also be able to use the sort plugin on the reduce side just like they can make use of the sort plugin on the map side.
        Show
        Mariappan Asokan added a comment - I will briefly outline what the patch does and the rationale. It makes the MergeManager pluggable. Rationale: MergeManager does the merge sorting on the reduce side and is part of the overall sorting that happens in MR data flow. It makes MapOutput class overridable. Rationale: MergeManager plugin implementations can make efficient use of JVM memory for data shuffling and provide their own implementation of MapOutput. It makes local job runs use MergeManager or plugin implementations to do the merge sort instead of Merger. Rationale: Local job runs should also be able to use the sort plugin on the reduce side just like they can make use of the sort plugin on the map side.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12561370/mapreduce-4808.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        -1 javac. The applied patch generated 2018 javac compiler warnings (more than the trunk's current 2012 warnings).

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3129//testReport/
        Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3129//artifact/trunk/patchprocess/diffJavacWarnings.txt
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3129//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561370/mapreduce-4808.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. -1 javac . The applied patch generated 2018 javac compiler warnings (more than the trunk's current 2012 warnings). +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3129//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3129//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3129//console This message is automatically generated.
        Hide
        Mariappan Asokan added a comment -

        Fixed java compiler warnings in the tests.

        Show
        Mariappan Asokan added a comment - Fixed java compiler warnings in the tests.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12561532/mapreduce-4808.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3131//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3131//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561532/mapreduce-4808.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3131//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3131//console This message is automatically generated.
        Hide
        Tom White added a comment -

        These changes look reasonable to me. The changes in MapOutput means Fetcher doesn't know if the shuffle is in-memory or on-disk - it's hidden behind the MapOutput shuffle() method - which is an improved separation of concerns. MergeManagerPlugin is then free to return whatever implementations of MapOutput that it likes.

        • It wasn't immediately obvious to me that returning null in MergeManager#reserve() means "wait" (since the "wait" type of MapOutput has gone). It would be good to have a comment to that effect.
        • It would be good to have javadoc for the methods on MergeManagerPlugin.
        • In TestMergeManagerPlugin the try/catch blocks can be avoided by making the tests throw the relevant exception.
        Show
        Tom White added a comment - These changes look reasonable to me. The changes in MapOutput means Fetcher doesn't know if the shuffle is in-memory or on-disk - it's hidden behind the MapOutput shuffle() method - which is an improved separation of concerns. MergeManagerPlugin is then free to return whatever implementations of MapOutput that it likes. It wasn't immediately obvious to me that returning null in MergeManager#reserve() means "wait" (since the "wait" type of MapOutput has gone). It would be good to have a comment to that effect. It would be good to have javadoc for the methods on MergeManagerPlugin. In TestMergeManagerPlugin the try/catch blocks can be avoided by making the tests throw the relevant exception.
        Hide
        Arun C Murthy added a comment -

        Mariappan, can you pls put up a design doc on how you see all these apis being used?

        Rather than debate code upfront, let's get to same page on our goals and use-cases. Tx.

        Show
        Arun C Murthy added a comment - Mariappan, can you pls put up a design doc on how you see all these apis being used? Rather than debate code upfront, let's get to same page on our goals and use-cases. Tx.
        Hide
        Mariappan Asokan added a comment -

        Hi Tom,
        Thanks for your comments. I have made the changes you suggested and uploaded a new patch.

        Hi Arun,
        I created a design document that contains both the use cases and the rationale for the new interface for MergeManagerPlugin.

        Thanks to both of you for reviewing the patch.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Tom, Thanks for your comments. I have made the changes you suggested and uploaded a new patch. Hi Arun, I created a design document that contains both the use cases and the rationale for the new interface for MergeManagerPlugin. Thanks to both of you for reviewing the patch. – Asokan
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562030/MergeManagerPlugin.pdf
        against trunk revision .

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3155//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562030/MergeManagerPlugin.pdf against trunk revision . -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3155//console This message is automatically generated.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562034/mapreduce-4808.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3156//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3156//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562034/mapreduce-4808.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3156//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3156//console This message is automatically generated.
        Hide
        Mariappan Asokan added a comment -

        Picked up changes I made to MergeManager.java in MAPREDUCE-4842 to fix a severe bug and uploaded a new patch.

        Show
        Mariappan Asokan added a comment - Picked up changes I made to MergeManager.java in MAPREDUCE-4842 to fix a severe bug and uploaded a new patch.
        Hide
        Tendu Yogurtcu added a comment -

        Thank you for your message. I am out of the office with limited access to e-mail, returning on Monday Jan 7th. Should you need immediate assistance, please contact Fernanda Tavares at ftavares@syncsort.com<ftavares@syncsort.com>. Wishing you a Happy Holiday Season and a Happy New Year!

        Regards,

        Show
        Tendu Yogurtcu added a comment - Thank you for your message. I am out of the office with limited access to e-mail, returning on Monday Jan 7th. Should you need immediate assistance, please contact Fernanda Tavares at ftavares@syncsort.com< ftavares@syncsort.com >. Wishing you a Happy Holiday Season and a Happy New Year! Regards,
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12562292/mapreduce-4808.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 4 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3171//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3171//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562292/mapreduce-4808.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 4 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3171//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3171//console This message is automatically generated.
        Hide
        Alejandro Abdelnur added a comment -

        +1, to the patch and design doc.

        I'll add that having this pluggability in place it would have allowed to have a fix for MAPREDUCE-4842 avail for jobs hitting the problem until the fix is delivered in a Hadoop release.

        I'll wait until after the New Year to give the opportunity to others to review/comment the patch and design doc.

        Show
        Alejandro Abdelnur added a comment - +1, to the patch and design doc. I'll add that having this pluggability in place it would have allowed to have a fix for MAPREDUCE-4842 avail for jobs hitting the problem until the fix is delivered in a Hadoop release. I'll wait until after the New Year to give the opportunity to others to review/comment the patch and design doc.
        Hide
        Tendu Yogurtcu added a comment -

        Thank you for your message. I am out of the office with limited access to e-mail, returning on Monday Jan 7th. Should you need immediate assistance, please contact Fernanda Tavares at ftavares@syncsort.com<ftavares@syncsort.com>. Wishing you a Happy Holiday Season and a Happy New Year!

        Regards,

        Show
        Tendu Yogurtcu added a comment - Thank you for your message. I am out of the office with limited access to e-mail, returning on Monday Jan 7th. Should you need immediate assistance, please contact Fernanda Tavares at ftavares@syncsort.com< ftavares@syncsort.com >. Wishing you a Happy Holiday Season and a Happy New Year! Regards,
        Hide
        Tom White added a comment -

        Asokan, thanks for addressing my feedback. The latest patch looks good to me.

        Show
        Tom White added a comment - Asokan, thanks for addressing my feedback. The latest patch looks good to me.
        Hide
        Arun C Murthy added a comment -

        Asokan, sorry I've been away traveling home during the holidays and hence the delay.

        I have more comments, but I'll put some here to keep the discussion going.

        Thanks for the design doc, but I was looking for thoughts on how the plugin was going used for use-cases you've mentioned (hash-join etc.), alternatives on design etc.

        IAC, taking a step back, the 'goal' here is to make the 'merge' pluggable.

        Reduce-side has 2 pieces:

        1. Shuffle - Move data from maps to the reduce.
        2. Merge - Merge already sorted map-outputs.

        The rest (MergeManager etc.) are merely implementation details to manage memory etc., which are irrelevant in several scenarios as soon as we consider alternatives to the current HTTP-based shuffle (several alternatives exist such RDMA etc.).

        Your current approach tries to encapsulate and enshrine the current implementation of the reduce task, which I'm not wild about. By this I mean, you are focussing too much on the current state and trying to make interfaces which are unnecessary for now and might not suffice for the future.

        I really don't think we should be tying Shuffle & Merge as you have done by introducing yet another new interface (regardless of whether it's public or not).

        As I've noted above, adding a simple 'Merge' interface with one 'merge' call will address all of the use-cases you have outlined. If not, let's discuss.

        Show
        Arun C Murthy added a comment - Asokan, sorry I've been away traveling home during the holidays and hence the delay. I have more comments, but I'll put some here to keep the discussion going. Thanks for the design doc, but I was looking for thoughts on how the plugin was going used for use-cases you've mentioned (hash-join etc.), alternatives on design etc. IAC, taking a step back, the 'goal' here is to make the 'merge' pluggable. Reduce-side has 2 pieces: Shuffle - Move data from maps to the reduce. Merge - Merge already sorted map-outputs. The rest (MergeManager etc.) are merely implementation details to manage memory etc., which are irrelevant in several scenarios as soon as we consider alternatives to the current HTTP-based shuffle (several alternatives exist such RDMA etc.). Your current approach tries to encapsulate and enshrine the current implementation of the reduce task, which I'm not wild about. By this I mean, you are focussing too much on the current state and trying to make interfaces which are unnecessary for now and might not suffice for the future. I really don't think we should be tying Shuffle & Merge as you have done by introducing yet another new interface (regardless of whether it's public or not). As I've noted above, adding a simple 'Merge' interface with one 'merge' call will address all of the use-cases you have outlined. If not, let's discuss.
        Hide
        Arun C Murthy added a comment -

        As I've noted above, adding a simple 'Merge' interface with one 'merge' call will address all of the use-cases you have outlined. If not, let's discuss.

        Forgot to add - if you don't think this suffices, please help me understand why. We can discuss alternatives then. Thanks.

        Show
        Arun C Murthy added a comment - As I've noted above, adding a simple 'Merge' interface with one 'merge' call will address all of the use-cases you have outlined. If not, let's discuss. Forgot to add - if you don't think this suffices, please help me understand why. We can discuss alternatives then. Thanks.
        Hide
        Arun C Murthy added a comment -

        As an example - with RDMA-based shuffle we don't need MergeManager at all since we can just do very-wide, network merges directly.

        Show
        Arun C Murthy added a comment - As an example - with RDMA-based shuffle we don't need MergeManager at all since we can just do very-wide, network merges directly.
        Hide
        Mariappan Asokan added a comment -

        Hi Arun,
        Thanks for your comments. The design for the use cases like hash-join, limit-N query, and so on is left to the creativity of the implementer of the plugin. I did not want to mention any specific designs. For the case of limit-N query, I created a test which contains one implementation of the plugin.

        You mentioned RDMA shuffle as one of the alternative shuffle implementations. The RDMA shuffle requires special hardware(infiniband card) which may not be present in all Hadoop installations. RDMA based shuffle does not require MergeManager because it is a combination of shuffle and merge implemented in native code. There is no clear separation of shuffle and merge.

        The current HTTP shuffle has been around in Hadoop for a long time and functionally it will continue to work even with infiniband cards by using IP over Infiniband(IPoIB) without requiring any native code.

        I consider RDMA shuffle as a special case and it is not going to be very common to warrant obsoleting the current separation of shuffle and merge. Besides, a merge plugin does not preclude RDMA merge. A shuffle plugin can be used for that purpose.

        The MergeManager not only manages memory, it also coordinates with the shuffle and manages mulitple merge passes. The interface I have defined captures the methods needed for the above purposes. A single merge() method will not suffice to take care of merging shuffled data.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Arun, Thanks for your comments. The design for the use cases like hash-join, limit-N query, and so on is left to the creativity of the implementer of the plugin. I did not want to mention any specific designs. For the case of limit-N query, I created a test which contains one implementation of the plugin. You mentioned RDMA shuffle as one of the alternative shuffle implementations. The RDMA shuffle requires special hardware(infiniband card) which may not be present in all Hadoop installations. RDMA based shuffle does not require MergeManager because it is a combination of shuffle and merge implemented in native code. There is no clear separation of shuffle and merge. The current HTTP shuffle has been around in Hadoop for a long time and functionally it will continue to work even with infiniband cards by using IP over Infiniband(IPoIB) without requiring any native code. I consider RDMA shuffle as a special case and it is not going to be very common to warrant obsoleting the current separation of shuffle and merge. Besides, a merge plugin does not preclude RDMA merge. A shuffle plugin can be used for that purpose. The MergeManager not only manages memory, it also coordinates with the shuffle and manages mulitple merge passes. The interface I have defined captures the methods needed for the above purposes. A single merge() method will not suffice to take care of merging shuffled data. – Asokan
        Hide
        Jerry Chen added a comment -

        Hi Arun and Asokan,
        I am trying an implemenation of hash based reduce other than the global sort and merge. And first of all, Asokan's work is valuable for making this implemenation possible. And I have tried to use interfaces in this patch to plugin the HashMergeManager. Mostly, the current interface can satisfy the hash merge manager needs. But I am not sure whether the interface is "suffice for the future" for others just as Arun point out.

        While the problem I encountered during the implementation is the reusing of the common code of the MergeManager. HashMergeManager shares a lot of common feature and code from the MergeManager such as OnDiskMapOutput, InMemoryMapOutput, InMemoryReader and MergeThread (PartitionThread actually). But the current code base of these classes make references to MergeManager and thus can not be reused by a HashMergeManager. I have to make another copy of these classes and modify them to refer to HashMergeManager. These would generate duplicated code and would not be preferred.

        We would best either make MergeManager inheritable (by making some private member protected) or abstract out OnDiskMapOutput and InMemoryMapoutput and others to be utility classes that can be shared by some "light weight" merge managers which share a lot of common with MergeManager and only modify some small aspects.

        Jerry

        Show
        Jerry Chen added a comment - Hi Arun and Asokan, I am trying an implemenation of hash based reduce other than the global sort and merge. And first of all, Asokan's work is valuable for making this implemenation possible. And I have tried to use interfaces in this patch to plugin the HashMergeManager. Mostly, the current interface can satisfy the hash merge manager needs. But I am not sure whether the interface is "suffice for the future" for others just as Arun point out. While the problem I encountered during the implementation is the reusing of the common code of the MergeManager. HashMergeManager shares a lot of common feature and code from the MergeManager such as OnDiskMapOutput, InMemoryMapOutput, InMemoryReader and MergeThread (PartitionThread actually). But the current code base of these classes make references to MergeManager and thus can not be reused by a HashMergeManager. I have to make another copy of these classes and modify them to refer to HashMergeManager. These would generate duplicated code and would not be preferred. We would best either make MergeManager inheritable (by making some private member protected) or abstract out OnDiskMapOutput and InMemoryMapoutput and others to be utility classes that can be shared by some "light weight" merge managers which share a lot of common with MergeManager and only modify some small aspects. Jerry
        Hide
        Mariappan Asokan added a comment -

        Hi Jerry,
        Thanks for your feedback and comments. There may be several ways you can implement a MergeManager plugin. Inheriting from current MergeManager is one of them that you tried for your implementation.

        I would suggest you take a look at the test that I created for the patch. It contains a plugin implementation that does not do any sorting both on the map and reduce sides. Although it is a very simple implementation, it is a good starting point for your hash aggregation. I attached it to some of the previous patches I posted for MAPREDUCE-2454. If you want the latest copy, please shoot an e-mail to me. I will send it to you.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Jerry, Thanks for your feedback and comments. There may be several ways you can implement a MergeManager plugin. Inheriting from current MergeManager is one of them that you tried for your implementation. I would suggest you take a look at the test that I created for the patch. It contains a plugin implementation that does not do any sorting both on the map and reduce sides. Although it is a very simple implementation, it is a good starting point for your hash aggregation. I attached it to some of the previous patches I posted for MAPREDUCE-2454 . If you want the latest copy, please shoot an e-mail to me. I will send it to you. – Asokan
        Hide
        Alejandro Abdelnur added a comment -

        Jerry,

        Would your suggested changes be on the MergeManager implementation only? Without affecting the interfaces this JIRA defines? If that is the case, I'd suggest having a different JIRA to modify the visibility of the methods/classes from MergeManager; by doing that we keep this JIRA focused on making MergeManager pluggable and the follow up JIRA to enable reusing it.

        Regarding your comments on if this interface will 'suffice for the future', that is why all these classes/interfaces are considered UNSTABLE and LIMITED_PRIVATE. In other words, this is an initial step that enables the current needs but we could change it in the future.

        Show
        Alejandro Abdelnur added a comment - Jerry, Would your suggested changes be on the MergeManager implementation only? Without affecting the interfaces this JIRA defines? If that is the case, I'd suggest having a different JIRA to modify the visibility of the methods/classes from MergeManager; by doing that we keep this JIRA focused on making MergeManager pluggable and the follow up JIRA to enable reusing it. Regarding your comments on if this interface will 'suffice for the future', that is why all these classes/interfaces are considered UNSTABLE and LIMITED_PRIVATE. In other words, this is an initial step that enables the current needs but we could change it in the future.
        Hide
        Chris Douglas added a comment -

        I like many of these changes anyway- the refactoring from tagged unions to a type hierarchy for MapOutput is nice- though I'm still not groking why the new API is necessary. The design doc reads like a summary of the patches and a motivation for pluggability generally, without directly motivating this particular moving part. Mariappan Asokan, is there code from MAPREDUCE-2454 reviewers should look at to better understand why this is required?

        While much of the shuffle code can be reused between implementations- as Jerry points out- a common base for all implementations would be overly-ambitious. To rephrase what Arun has been saying: adding a "plugin" for tweaking the existing code is not necessary when the whole Shuffle can be swapped out after MAPREDUCE-4049.

        The only exception I can think of is for LocalRunner, where the Merger- not the MergeManager- needs to be pluggable. The current patch introduces a dependency on the MergeManager for the local case, using only the subset that applies in that context. Without a use case (e.g., useful for testing), it's hard to justify the patch's abstraction if all implementations of MergeManager are exposed to this special invocation. To be fair, it may be required for the "ubertask" code to work seamlessly, but that case needs to be made. Even so: the Merger seems like the right cut point.

        To be concrete, take the RDMA/network merge case as an example. Assume the overhead of a reader requires one to limit the number of clients/open connections in the reduce. Implementing a Merger that keeps a total order on its segments and only keeps the top k remote segments open would loosely track that resource. But since the Fetcher is essentially just packaging the remote segment and translating the TCE into a MapOutput (i.e., doesn't open a connection), it's not obvious why a MergeManager would coordinate resource use between them. Even with a hybrid scheme- where a set of Fetcher and Merger instances pull segments locally based on some cost function, so they do share resources- aren't the changes extensive enough to justify a separate Shuffle implementation?

        The Merger does handle multiple merge passes; intermediate merges of segments are part of its contract. While you may be referring to continuous merging of segments arriving concurrently with a merge, again: why would it be preferable to configure a MergeManager instead of a wholly new Shuffle impl?

        Show
        Chris Douglas added a comment - I like many of these changes anyway- the refactoring from tagged unions to a type hierarchy for MapOutput is nice- though I'm still not groking why the new API is necessary. The design doc reads like a summary of the patches and a motivation for pluggability generally, without directly motivating this particular moving part. Mariappan Asokan , is there code from MAPREDUCE-2454 reviewers should look at to better understand why this is required? While much of the shuffle code can be reused between implementations- as Jerry points out- a common base for all implementations would be overly-ambitious. To rephrase what Arun has been saying: adding a "plugin" for tweaking the existing code is not necessary when the whole Shuffle can be swapped out after MAPREDUCE-4049 . The only exception I can think of is for LocalRunner , where the Merger - not the MergeManager - needs to be pluggable. The current patch introduces a dependency on the MergeManager for the local case, using only the subset that applies in that context. Without a use case (e.g., useful for testing), it's hard to justify the patch's abstraction if all implementations of MergeManager are exposed to this special invocation. To be fair, it may be required for the "ubertask" code to work seamlessly, but that case needs to be made. Even so: the Merger seems like the right cut point. To be concrete, take the RDMA/network merge case as an example. Assume the overhead of a reader requires one to limit the number of clients/open connections in the reduce. Implementing a Merger that keeps a total order on its segments and only keeps the top k remote segments open would loosely track that resource. But since the Fetcher is essentially just packaging the remote segment and translating the TCE into a MapOutput (i.e., doesn't open a connection), it's not obvious why a MergeManager would coordinate resource use between them. Even with a hybrid scheme- where a set of Fetcher and Merger instances pull segments locally based on some cost function, so they do share resources- aren't the changes extensive enough to justify a separate Shuffle implementation? The Merger does handle multiple merge passes; intermediate merges of segments are part of its contract. While you may be referring to continuous merging of segments arriving concurrently with a merge, again: why would it be preferable to configure a MergeManager instead of a wholly new Shuffle impl?
        Hide
        Alejandro Abdelnur added a comment -

        Arun, Chris, Asokan,

        It seem the contention point here is adding the MergerManager as public API.

        How about the following alternative?

        • 1. Refactoring of the MapOutput into abstract class and introduce the Memory and Disk subclasses.
        • 2. Make the MergerManager pluggable in the Shuffle class via ReflectionUtils or a protected createMergeManager() method.
        • 3. Move the thread creation from the MergeManager constructor to an init() method.

        This would keep the MergerManager specific to the Shuffle implementation but it would allow alternate implementations to reuse parts of it.

        The only thing left would be how to handle the local case which currently is done by the Merger via a static method.

        Show
        Alejandro Abdelnur added a comment - Arun, Chris, Asokan, It seem the contention point here is adding the MergerManager as public API. How about the following alternative? 1. Refactoring of the MapOutput into abstract class and introduce the Memory and Disk subclasses. 2. Make the MergerManager pluggable in the Shuffle class via ReflectionUtils or a protected createMergeManager() method. 3. Move the thread creation from the MergeManager constructor to an init() method. This would keep the MergerManager specific to the Shuffle implementation but it would allow alternate implementations to reuse parts of it. The only thing left would be how to handle the local case which currently is done by the Merger via a static method.
        Hide
        Chris Douglas added a comment -

        Alejandro Abdelnur: that sounds reasonable; it should also simplify the API by eliminating the mergeLocalFiles method.

        Show
        Chris Douglas added a comment - Alejandro Abdelnur : that sounds reasonable; it should also simplify the API by eliminating the mergeLocalFiles method.
        Hide
        Mariappan Asokan added a comment -

        Hi Chris and Alejandro,
        Thanks to both of you for your comments and suggestions. I spent some time reflecting on them. I totally agree with Chris that the RDMA/network merge will be different from the way the merge is done by the current MergeManager. In that case a shuffle plugin is the way to go.

        When the current HTTP based Shuffle is used, it makes sense to define a MergeManagerPlugin to work with that. So based on Alejandro's suggestion, I updated the patch to make MergeManagerPlugin work only with Shuffle. I removed the method mergeLocalFiles() from the interface as Chris suggested. The Shuffle class would instantiate the MergeManagerPlugin object in its init() method.

        By the end of the day, Hadoop users will be offered more choices: a shuffle plugin for RDMA/network merge when the cluster nodes have hardware support and a merge plugin that works with the current shuffle when the cluster does not have special network hardware.

        Please provide any feedback on the updated patch.

        Once again, thanks to both of you for the time you are spending on this Jira.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Chris and Alejandro, Thanks to both of you for your comments and suggestions. I spent some time reflecting on them. I totally agree with Chris that the RDMA/network merge will be different from the way the merge is done by the current MergeManager. In that case a shuffle plugin is the way to go. When the current HTTP based Shuffle is used, it makes sense to define a MergeManagerPlugin to work with that. So based on Alejandro's suggestion, I updated the patch to make MergeManagerPlugin work only with Shuffle. I removed the method mergeLocalFiles() from the interface as Chris suggested. The Shuffle class would instantiate the MergeManagerPlugin object in its init() method. By the end of the day, Hadoop users will be offered more choices: a shuffle plugin for RDMA/network merge when the cluster nodes have hardware support and a merge plugin that works with the current shuffle when the cluster does not have special network hardware. Please provide any feedback on the updated patch. Once again, thanks to both of you for the time you are spending on this Jira. – Asokan
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12564997/mapreduce-4808.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3242//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3242//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12564997/mapreduce-4808.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3242//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3242//console This message is automatically generated.
        Hide
        Alejandro Abdelnur added a comment -

        Thanks Asokan, LGTM. Chris, is this what you had in mind? Are we good to go? Thx

        Show
        Alejandro Abdelnur added a comment - Thanks Asokan, LGTM. Chris, is this what you had in mind? Are we good to go? Thx
        Hide
        Mariappan Asokan added a comment -

        Added the call to init() method of MergeManagerPlugin in Shuffle's init(). Surprised that the testing did not catch this.

        – Asokan

        Show
        Mariappan Asokan added a comment - Added the call to init() method of MergeManagerPlugin in Shuffle 's init(). Surprised that the testing did not catch this. – Asokan
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12565016/mapreduce-4808.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3244//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3244//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565016/mapreduce-4808.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 3 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3244//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3244//console This message is automatically generated.
        Hide
        Arun C Murthy added a comment -

        This is better - but it still makes the MergeManager a public api, which I'm not wild about.

        We aren't we considering the Merger? It would be simple and would easily solve the limit-N case by aborting the merge after N+1 items are seen.

        Show
        Arun C Murthy added a comment - This is better - but it still makes the MergeManager a public api, which I'm not wild about. We aren't we considering the Merger? It would be simple and would easily solve the limit-N case by aborting the merge after N+1 items are seen.
        Hide
        Arun C Murthy added a comment -

        Would you prefer to have me implement a prototype of the Merger?

        Show
        Arun C Murthy added a comment - Would you prefer to have me implement a prototype of the Merger?
        Hide
        Arun C Murthy added a comment -

        As I've said before MergeManager is an unholy alliance of Shuffle & Merge phases, we really should separate them - not enshrine them further by making MergeManager a public api.

        Show
        Arun C Murthy added a comment - As I've said before MergeManager is an unholy alliance of Shuffle & Merge phases, we really should separate them - not enshrine them further by making MergeManager a public api.
        Hide
        Arun C Murthy added a comment -

        I'll admit I'm really confused - the jira says 'make reduce-side merge pluggable' and then the patch does changes to MapOutput, tries to make MergeManager a resolver etc.

        OTOH, there is a concrete alternative to introduce a Merger abstraction which solves the use-cases described i.e. limit-N, hash-join etc.

        Why are we not doing the straight-fwd thing of solving what the jira intends to do i.e. make merge pluggable?

        Are we trying to solve something else here? If so, can we at least articulate what we are trying to solve? Thanks.

        Show
        Arun C Murthy added a comment - I'll admit I'm really confused - the jira says 'make reduce-side merge pluggable' and then the patch does changes to MapOutput, tries to make MergeManager a resolver etc. OTOH, there is a concrete alternative to introduce a Merger abstraction which solves the use-cases described i.e. limit-N, hash-join etc. Why are we not doing the straight-fwd thing of solving what the jira intends to do i.e. make merge pluggable? Are we trying to solve something else here? If so, can we at least articulate what we are trying to solve? Thanks.
        Hide
        Alejandro Abdelnur added a comment -

        Arun, thanks for pointing out the wrong summary/description. I think we ended up with it when creating the subtasks for MAPREDUCE-2454 based on previous ideas.

        I've just updated it reflect the scope of the current patch.

        Show
        Alejandro Abdelnur added a comment - Arun, thanks for pointing out the wrong summary/description. I think we ended up with it when creating the subtasks for MAPREDUCE-2454 based on previous ideas. I've just updated it reflect the scope of the current patch.
        Hide
        Arun C Murthy added a comment -

        Alejandro Abdelnur Thanks for clarifying that this is indeed a different endeavor.

        Can you please help me understand what the goals now are? What are we hoping to achieve with this re-factor? Is this something specific to Syncsort's needs? I'm not against it, I just want to understand why we are doing this re-factor and how it will be used by Syncsort. Thanks.

        Show
        Arun C Murthy added a comment - Alejandro Abdelnur Thanks for clarifying that this is indeed a different endeavor. Can you please help me understand what the goals now are? What are we hoping to achieve with this re-factor? Is this something specific to Syncsort's needs? I'm not against it, I just want to understand why we are doing this re-factor and how it will be used by Syncsort. Thanks.
        Hide
        Alejandro Abdelnur added a comment -

        The goal is to be able to write alternate implementations of the Shuffle, like the ones Asokan and Jerry are trying to do, while reusing functionality provided by the Hadoop default implementation. For example, being able to leverage the logic in the default shuffle (ie fetchers), while replacing the merge logic and merge resources allocation logic driven by the MergeManager. While some of this logic replacement could be done at Merge level as you suggested, other, like MapOutput allocation cannot be done there as this is driven by the MergeManager. The refactoring of MapOutput allocation from struct to classes permits alternate implementations that can reuse memory buffers thus reducing JVM heap allocation.

        Show
        Alejandro Abdelnur added a comment - The goal is to be able to write alternate implementations of the Shuffle, like the ones Asokan and Jerry are trying to do, while reusing functionality provided by the Hadoop default implementation. For example, being able to leverage the logic in the default shuffle (ie fetchers), while replacing the merge logic and merge resources allocation logic driven by the MergeManager. While some of this logic replacement could be done at Merge level as you suggested, other, like MapOutput allocation cannot be done there as this is driven by the MergeManager. The refactoring of MapOutput allocation from struct to classes permits alternate implementations that can reuse memory buffers thus reducing JVM heap allocation.
        Hide
        Alejandro Abdelnur added a comment -

        Assuming all concerns/questions have been addressed, can we move forward and commit the latest patch?

        Show
        Alejandro Abdelnur added a comment - Assuming all concerns/questions have been addressed, can we move forward and commit the latest patch?
        Hide
        Arun C Murthy added a comment -

        The goal is to be able to write alternate implementations of the Shuffle

        Alejandro - it seems like you understand something about the use-case that I don't. Maybe you & Asokan have had a private chat?

        What are the use-cases for alternate implementations of the Shuffle? Like Chris also mentioned with MAPREDUCE-4049 we already allow alternate implementations of Shuffle, is this redundant then?

        While some of this logic replacement could be done at Merge level as you suggested, other, like MapOutput allocation cannot be done there as this is driven by the MergeManager.

        So, a combination of MapOutput re-factor and Merger interface should suffice?

        IAC, what are the use-cases for alternate implementations of MapOutput? Or, is it the MapOutput re-factor merely a code-hygiene issue?


        I'm not trying to be difficult here. But, I feel like I just don't understand the use-case. So, I'd appreciate if we could focus on concrete use-cases for the plugin. I admit I still am having a hard time understanding why we need this complexity.

        Thanks.

        Show
        Arun C Murthy added a comment - The goal is to be able to write alternate implementations of the Shuffle Alejandro - it seems like you understand something about the use-case that I don't. Maybe you & Asokan have had a private chat? What are the use-cases for alternate implementations of the Shuffle? Like Chris also mentioned with MAPREDUCE-4049 we already allow alternate implementations of Shuffle, is this redundant then? While some of this logic replacement could be done at Merge level as you suggested, other, like MapOutput allocation cannot be done there as this is driven by the MergeManager. So, a combination of MapOutput re-factor and Merger interface should suffice? IAC, what are the use-cases for alternate implementations of MapOutput? Or, is it the MapOutput re-factor merely a code-hygiene issue? I'm not trying to be difficult here. But, I feel like I just don't understand the use-case. So, I'd appreciate if we could focus on concrete use-cases for the plugin. I admit I still am having a hard time understanding why we need this complexity. Thanks.
        Hide
        Mariappan Asokan added a comment -

        Hi Arun,
        MAPREDUCE-4049 expects the plugin implementer to implement the shuffle from scratch. With the default implementation of HTTP shuffle being robust and secure it is possible to reuse it in majority of the situations.

        The alternate implementation of MapOutput can be left to the plugin implementer. For example, it can be optimized to use less JVM memory and minimize Java garbage collection.

        Some of the concrete use cases for the plugin are: hash aggregation, hash join, limit-N query, etc.

        Thanks.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Arun, MAPREDUCE-4049 expects the plugin implementer to implement the shuffle from scratch. With the default implementation of HTTP shuffle being robust and secure it is possible to reuse it in majority of the situations. The alternate implementation of MapOutput can be left to the plugin implementer. For example, it can be optimized to use less JVM memory and minimize Java garbage collection. Some of the concrete use cases for the plugin are: hash aggregation, hash join, limit-N query, etc. Thanks. – Asokan
        Hide
        Chris Douglas added a comment -

        Asokan, the concern is that even breaking an API, even if it's marked unstable, is an incompatible change. Since the pluggable shuffle is particularly useful for frameworks, breaking this contract could require patching/validation/rewrite of plugin and optimizer code in projects that invest in it (Hive, Pig, etc.). Moreover, if we wanted to change the default Shuffle to a different implementation, then user/framework code would perform badly- or break- unless we exposed this implementation-specific mechanism in the new impl. So it's fair to press for use cases, to ensure it's sufficient and that the abstraction could apply to most Shuffle implementations.

        Personally, I'm ambivalent about exposing this as an API and am +1 on the patch overall (mostly because I like the MapOutput refactoring). The user can always configure the current Shuffle, which is exactly how frameworks would handle this until they port/specialize their efficient MergeManager plugin.

        As a compromise, would it make sense to just add a protected createMergeManager method to the Shuffle? The user still needs to configure their custom Shuffle impl now, but that's better than the inevitable future where they configure both. It also makes its tie to this implementation explicit.

        Show
        Chris Douglas added a comment - Asokan, the concern is that even breaking an API, even if it's marked unstable, is an incompatible change. Since the pluggable shuffle is particularly useful for frameworks, breaking this contract could require patching/validation/rewrite of plugin and optimizer code in projects that invest in it (Hive, Pig, etc.). Moreover, if we wanted to change the default Shuffle to a different implementation, then user/framework code would perform badly- or break- unless we exposed this implementation-specific mechanism in the new impl. So it's fair to press for use cases, to ensure it's sufficient and that the abstraction could apply to most Shuffle implementations. Personally, I'm ambivalent about exposing this as an API and am +1 on the patch overall (mostly because I like the MapOutput refactoring). The user can always configure the current Shuffle , which is exactly how frameworks would handle this until they port/specialize their efficient MergeManager plugin. As a compromise, would it make sense to just add a protected createMergeManager method to the Shuffle ? The user still needs to configure their custom Shuffle impl now, but that's better than the inevitable future where they configure both. It also makes its tie to this implementation explicit.
        Hide
        Alejandro Abdelnur added a comment -

        Chris, are you suggesting?

        • remove the MergeManagerPlugin interface
        • introduce a protected createMergerManager() in the Shuffle class to instantiate (via new) & initialize the existing MergerManager.
        Show
        Alejandro Abdelnur added a comment - Chris, are you suggesting? remove the MergeManagerPlugin interface introduce a protected createMergerManager() in the Shuffle class to instantiate (via new) & initialize the existing MergerManager.
        Hide
        Mariappan Asokan added a comment -

        Hi Arun,
        I will think about your suggestion to make the Merger class pluggable and post my findings for different use cases.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Arun, I will think about your suggestion to make the Merger class pluggable and post my findings for different use cases. – Asokan
        Hide
        Mariappan Asokan added a comment -

        Hi Chris,
        I will work on creating a real working plugin for the use cases to show that the proposed API is sufficient to handle them.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Chris, I will work on creating a real working plugin for the use cases to show that the proposed API is sufficient to handle them. – Asokan
        Hide
        Mariappan Asokan added a comment -

        Hi Alejandro,
        If the MergeManagerPlugin is to be removed, it should be possible to extend the framework's MergeManager by an external implementation.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Alejandro, If the MergeManagerPlugin is to be removed, it should be possible to extend the framework's MergeManager by an external implementation. – Asokan
        Hide
        Mariappan Asokan added a comment -

        Hi Alejandro,
        I meant to ask whether it is okay to make the existing MergeManager to be extendable?

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Alejandro, I meant to ask whether it is okay to make the existing MergeManager to be extendable? – Asokan
        Hide
        Mariappan Asokan added a comment -

        Hi Arun,
        I will try to explain a simple use case of an external implementation of merge on the reduce side. Let us say this merge implementation has some fixed area of memory (Java byte array) allocated to store the shuffled data. This may be done to avoid frequent garbage collection by JVM or for better processor cache efficiency.

        Looking at the methods in the Merge class, they either accept input to the merge in disk files(array of Path objects) or memory segments(list of Segment objects.) The former is not suitable since merge is done in memory first and any intermediate merged output file is under the control of the plugin implementation. The latter is not suitable because memory for the shuffled data is not under the control of the plugin implementation.

        Ideally, if an InputStream object is available, the external implementation can read shuffled data from the stream to the fixed area of memory at a specific offset in the byte array.

        With the MergeManagerPlugin, the external implementation will get the HTTP connection's InputStream object via the shuffle() method in MapOutput object. In addition, if merge goes though multiple passes because the memory area is limited in size, there should be some way for the Shuffle to wait until memory is released by a merge pass. There is no method in Merge for that either.

        I find that it is possible to define the interaction points between current Shuffle and MergeManager using the MergeManagerPlugin interface. The plugin interface has only three methods and it allows the external plugin to have a lot of freedom in its implementation. As a side effect, the MapOutput is also refactored.

        Hope I explained this well. If you have any questions, please let me know.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Arun, I will try to explain a simple use case of an external implementation of merge on the reduce side. Let us say this merge implementation has some fixed area of memory (Java byte array) allocated to store the shuffled data. This may be done to avoid frequent garbage collection by JVM or for better processor cache efficiency. Looking at the methods in the Merge class, they either accept input to the merge in disk files(array of Path objects) or memory segments(list of Segment objects.) The former is not suitable since merge is done in memory first and any intermediate merged output file is under the control of the plugin implementation. The latter is not suitable because memory for the shuffled data is not under the control of the plugin implementation. Ideally, if an InputStream object is available, the external implementation can read shuffled data from the stream to the fixed area of memory at a specific offset in the byte array. With the MergeManagerPlugin, the external implementation will get the HTTP connection's InputStream object via the shuffle() method in MapOutput object. In addition, if merge goes though multiple passes because the memory area is limited in size, there should be some way for the Shuffle to wait until memory is released by a merge pass. There is no method in Merge for that either. I find that it is possible to define the interaction points between current Shuffle and MergeManager using the MergeManagerPlugin interface. The plugin interface has only three methods and it allows the external plugin to have a lot of freedom in its implementation. As a side effect, the MapOutput is also refactored. Hope I explained this well. If you have any questions, please let me know. – Asokan
        Hide
        Alejandro Abdelnur added a comment -

        I've taken the liberty to tweak the patch a bit based the last comments in the JIRA.

        • removed pluggability via config of the MergeManager
        • Shuffle has a protected createMergeManager() method
        • MergeManager is annotated as Private
        • Kept MergeManagerPlugin interface
        • Removed MergeManagerPlugin.Context
        • MergeManagerPlugin interface annotated as Private

        These changes avoid having an extra know (the MergeManager class) in the config. Keep the MergeManager owned by the Shuffle class. The interface allows, for impls like Jerry's and Asokan's, for alternate implementations.

        Asokan, Arun, Chris?

        Show
        Alejandro Abdelnur added a comment - I've taken the liberty to tweak the patch a bit based the last comments in the JIRA. removed pluggability via config of the MergeManager Shuffle has a protected createMergeManager() method MergeManager is annotated as Private Kept MergeManagerPlugin interface Removed MergeManagerPlugin.Context MergeManagerPlugin interface annotated as Private These changes avoid having an extra know (the MergeManager class) in the config. Keep the MergeManager owned by the Shuffle class. The interface allows, for impls like Jerry's and Asokan's, for alternate implementations. Asokan, Arun, Chris?
        Hide
        Chris Douglas added a comment -

        +1 Looked through it; the latest patch lgtm. Asokan, is that sufficient for your use cases? Arun?

        Very minor, optional nit: s/MergeManager/MergeManagerImpl/ and s/MergeManagerPlugin/MergeManager/. There's an argument to be made for doing the same with the ShuffleScheduler while we're at it, but neither of these are blocking, IMO.

        Show
        Chris Douglas added a comment - +1 Looked through it; the latest patch lgtm. Asokan, is that sufficient for your use cases? Arun? Very minor, optional nit: s/MergeManager/MergeManagerImpl/ and s/MergeManagerPlugin/MergeManager/ . There's an argument to be made for doing the same with the ShuffleScheduler while we're at it, but neither of these are blocking, IMO.
        Hide
        Mariappan Asokan added a comment -

        Hi Chris,
        Thanks for your quick feedback. I looked at the patch. It has one minor nit. The createMergeManager method should take ShuffleConsumerPlugin.Context object. I will go over it one more time, work out the change, run tests, and post the patch shortly.

        Thanks.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Chris, Thanks for your quick feedback. I looked at the patch. It has one minor nit. The createMergeManager method should take ShuffleConsumerPlugin.Context object. I will go over it one more time, work out the change, run tests, and post the patch shortly. Thanks. – Asokan
        Hide
        Arun C Murthy added a comment -

        I will try to explain a simple use case of an external implementation of merge on the reduce side. Let us say this merge implementation has some fixed area of memory (Java byte array) allocated to store the shuffled data. This may be done to avoid frequent garbage collection by JVM or for better processor cache efficiency.

        Asokan - this is the first time I've heard this use case which seems something Syncsort can take advantage of, and, as a consequence, I've been viewing from the lens of 'limit-N/hash-join' merge etc.

        In future, being clear and upfront about use-cases will obviously prevent further such confusion.


        Having said that, I still feel a better approach would be to use a custom shuffle via MAPREDUCE-4049 and friends since you get more control - for e.g. you might want to defer shuffle based on memory on the heap (byte[]) and memory outside heap (JNI or DirectBuffers) for Syncsort plugin - and clearly, the current MergeManager will not suffice for such.

        However, if this unblocks you in the short run I think the approach is fine. Thanks for the clarification. I'll take another look at the details on the patch once you upload it, but seem mostly fine to me. Thanks.

        Show
        Arun C Murthy added a comment - I will try to explain a simple use case of an external implementation of merge on the reduce side. Let us say this merge implementation has some fixed area of memory (Java byte array) allocated to store the shuffled data. This may be done to avoid frequent garbage collection by JVM or for better processor cache efficiency. Asokan - this is the first time I've heard this use case which seems something Syncsort can take advantage of, and, as a consequence, I've been viewing from the lens of 'limit-N/hash-join' merge etc. In future, being clear and upfront about use-cases will obviously prevent further such confusion. Having said that, I still feel a better approach would be to use a custom shuffle via MAPREDUCE-4049 and friends since you get more control - for e.g. you might want to defer shuffle based on memory on the heap (byte[]) and memory outside heap (JNI or DirectBuffers) for Syncsort plugin - and clearly, the current MergeManager will not suffice for such. However, if this unblocks you in the short run I think the approach is fine. Thanks for the clarification. I'll take another look at the details on the patch once you upload it, but seem mostly fine to me. Thanks.
        Hide
        Mariappan Asokan added a comment -

        I have uploaded the latest patch.

        Arun, thanks for your feedback. Your points are well taken. I realized that I should have posted the details of an implementation of a plugin to justify the design. Please take a look at the latest patch.

        Chris, I thought about your suggestion on naming. I came up with the following conclusions:

        • Changing the name of MergeManager at this point may cause problems if this Jira is ported to other branches(like 0.23.)
        • The existing test TestMergeManager also has to be renamed.

        So, I renamed MergeManagerPlugin to MergeManagerI since it is just an interface. Is that okay?

        Thanks.

        – Asokan

        Show
        Mariappan Asokan added a comment - I have uploaded the latest patch. Arun, thanks for your feedback. Your points are well taken. I realized that I should have posted the details of an implementation of a plugin to justify the design. Please take a look at the latest patch. Chris, I thought about your suggestion on naming. I came up with the following conclusions: Changing the name of MergeManager at this point may cause problems if this Jira is ported to other branches(like 0.23.) The existing test TestMergeManager also has to be renamed. So, I renamed MergeManagerPlugin to MergeManagerI since it is just an interface. Is that okay? Thanks. – Asokan
        Hide
        Chris Douglas added a comment -

        So, I renamed MergeManagerPlugin to MergeManagerI since it is just an interface. Is that okay?

        That's not a naming convention used anywhere else in Hadoop. I'd rather not introduce it for this case. Renaming test classes and problems with backporting are not issues.

        Show
        Chris Douglas added a comment - So, I renamed MergeManagerPlugin to MergeManagerI since it is just an interface. Is that okay? That's not a naming convention used anywhere else in Hadoop. I'd rather not introduce it for this case. Renaming test classes and problems with backporting are not issues.
        Hide
        Alejandro Abdelnur added a comment -

        Mhhh, I would prefer not rename MergeManager class as fixes in trunk (after the rename) will require manual patching in branch 0.23 and maintenance releases of branch 2. Why not leave the original MergerManagerPlugin name?

        Show
        Alejandro Abdelnur added a comment - Mhhh, I would prefer not rename MergeManager class as fixes in trunk (after the rename) will require manual patching in branch 0.23 and maintenance releases of branch 2. Why not leave the original MergerManagerPlugin name?
        Hide
        Chris Douglas added a comment -

        I'd argue:

        1. sed is not brain surgery and these issues will likely be backported to branch-2 anyway
        2. Taking the good names for interfaces, even if they're mostly private, is usually better than reserving them for implementations
        3. It shows that the MapOutput types are bound to the impl, e.g., InMemoryMapOutput calls closeInMemoryFile on the MergeManager instance, which is only part of that impl's API.

        This does the naive rename, and also renames the findbugs exclude rule for the MergeManager. I'd be OK with the patch as-is, but would prefer this.

        Show
        Chris Douglas added a comment - I'd argue: sed is not brain surgery and these issues will likely be backported to branch-2 anyway Taking the good names for interfaces, even if they're mostly private, is usually better than reserving them for implementations It shows that the MapOutput types are bound to the impl, e.g., InMemoryMapOutput calls closeInMemoryFile on the MergeManager instance, which is only part of that impl's API. This does the naive rename, and also renames the findbugs exclude rule for the MergeManager . I'd be OK with the patch as-is, but would prefer this.
        Hide
        Alejandro Abdelnur added a comment -

        Sure, me good. thx

        Show
        Alejandro Abdelnur added a comment - Sure, me good. thx
        Hide
        Mariappan Asokan added a comment -

        Hi Chris,
        Thanks for posting the proper patch. I am ashamed of myself I should have done it myself without coming up with excuses. I ran all mapreduce tests and verified your patch.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Chris, Thanks for posting the proper patch. I am ashamed of myself I should have done it myself without coming up with excuses. I ran all mapreduce tests and verified your patch. – Asokan
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12565578/M4808-0.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3252//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3252//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3252//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565578/M4808-0.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3252//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3252//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3252//console This message is automatically generated.
        Hide
        Chris Douglas added a comment -

        Wrong patch; meant to change the target of the findbugs suppression, not delete it. However, since the reasoning includes "not likely to be subclassed"- which is exactly what this issue does- maybe we should move this to init()

        Show
        Chris Douglas added a comment - Wrong patch; meant to change the target of the findbugs suppression, not delete it. However, since the reasoning includes "not likely to be subclassed"- which is exactly what this issue does- maybe we should move this to init()
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12565595/M4808-1.patch
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 2 new or modified test files.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3254//testReport/
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3254//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12565595/M4808-1.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3254//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3254//console This message is automatically generated.
        Hide
        Mariappan Asokan added a comment -

        Chris,
        Any shuffle plugin can subclass Shuffle but should implement its own MergeManager. Allowing one to subclass MergeManagerImpl is not within scope of this Jira. So, I am okay to suppress the warning for now.

        Thanks.
        – Asokan

        Show
        Mariappan Asokan added a comment - Chris, Any shuffle plugin can subclass Shuffle but should implement its own MergeManager. Allowing one to subclass MergeManagerImpl is not within scope of this Jira. So, I am okay to suppress the warning for now. Thanks. – Asokan
        Hide
        Chris Douglas added a comment -

        Fair point.

        This looks ready to commit. Unless someone would like additional changes, I'll plan to push it in tomorrow.

        Show
        Chris Douglas added a comment - Fair point. This looks ready to commit. Unless someone would like additional changes, I'll plan to push it in tomorrow.
        Hide
        Alejandro Abdelnur added a comment -

        Thanks Asokan and everybody. Committed to trunk.

        Show
        Alejandro Abdelnur added a comment - Thanks Asokan and everybody. Committed to trunk.
        Hide
        Hudson added a comment -

        Integrated in Hadoop-trunk-Commit #3266 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3266/)
        MAPREDUCE-4808. Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Show
        Hudson added a comment - Integrated in Hadoop-trunk-Commit #3266 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3266/ ) MAPREDUCE-4808 . Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Hide
        Mariappan Asokan added a comment -

        Alejandro, thanks for committing this.

        Alejandro, Arun, Chris, and Tom, thanks to all of you for providing valuable comments and feedback.

        – Asokan

        Show
        Mariappan Asokan added a comment - Alejandro, thanks for committing this. Alejandro, Arun, Chris, and Tom, thanks to all of you for providing valuable comments and feedback. – Asokan
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Yarn-trunk #105 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/105/)
        MAPREDUCE-4808. Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Show
        Hudson added a comment - Integrated in Hadoop-Yarn-trunk #105 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/105/ ) MAPREDUCE-4808 . Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #1294 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1294/)
        MAPREDUCE-4808. Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936)

        Result = FAILURE
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1294 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1294/ ) MAPREDUCE-4808 . Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936) Result = FAILURE tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #1322 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1322/)
        MAPREDUCE-4808. Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936)

        Result = SUCCESS
        tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1322 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1322/ ) MAPREDUCE-4808 . Refactor MapOutput and MergeManager to facilitate reuse by Shuffle implementations. (masokan via tucu) (Revision 1436936) Result = SUCCESS tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1436936 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/dev-support/findbugs-exclude.xml /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/InMemoryReader.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManagerImpl.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
        Hide
        Mariappan Asokan added a comment -

        Hi Alejandro, Arun, Chris, and Tom,
        Now that MAPREDUCE-4807, MAPREDUCE-4809, and MAPREDUCE-4808 were committed, I would like to work on MAPREDUCCE-4039(sort avoidance) as two contributed plugins. One will be to avoid sorting on the map side and the other on the reduce side. These plugins will open up more use cases. For example, Jerry can implement hash aggregation in the Combiner and/or in the Reducer without any overhead of the sort.

        If there is no objection from you, I will post a brief description of a design in MAPREDUCE-4039 soon. Please let me know.

        Thanks.

        – Asokan

        Show
        Mariappan Asokan added a comment - Hi Alejandro, Arun, Chris, and Tom, Now that MAPREDUCE-4807 , MAPREDUCE-4809 , and MAPREDUCE-4808 were committed, I would like to work on MAPREDUCCE-4039(sort avoidance) as two contributed plugins. One will be to avoid sorting on the map side and the other on the reduce side. These plugins will open up more use cases. For example, Jerry can implement hash aggregation in the Combiner and/or in the Reducer without any overhead of the sort. If there is no objection from you, I will post a brief description of a design in MAPREDUCE-4039 soon. Please let me know. Thanks. – Asokan

          People

          • Assignee:
            Mariappan Asokan
            Reporter:
            Arun C Murthy
          • Votes:
            0 Vote for this issue
            Watchers:
            24 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development