Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.2-alpha, 0.23.5
    • Fix Version/s: 2.0.3-alpha, 0.23.6
    • Component/s: mrv2
    • Labels:
      None

      Description

      Saw an instance where the shuffle caused multiple reducers in a job to hang. It looked similar to the problem described in MAPREDUCE-3721, where the fetchers were all being told to WAIT by the MergeManager but no merge was taking place.

      1. mapreduce-4842.patch
        15 kB
        Mariappan Asokan
      2. mapreduce-4842.patch
        15 kB
        Mariappan Asokan
      3. mapreduce-4842.patch
        15 kB
        Mariappan Asokan
      4. mapreduce-4842.patch
        6 kB
        Mariappan Asokan
      5. mapreduce-4842.patch
        6 kB
        Mariappan Asokan
      6. mapreduce-4842.patch
        6 kB
        Mariappan Asokan
      7. MAPREDUCE-4842.patch
        15 kB
        Jason Lowe
      8. MAPREDUCE-4842.patch
        15 kB
        Arun C Murthy
      9. MAPREDUCE-4842.patch
        14 kB
        Jason Lowe
      10. MAPREDUCE-4842.patch
        6 kB
        Arun C Murthy
      11. MAPREDUCE-4842-2.patch
        16 kB
        Jason Lowe

        Issue Links

          Activity

          Hide
          Mariappan Asokan added a comment -

          Hi Ravi,
          Thanks for the compliment. I will look at the patch for MAPREDUCE-3685 and post my comments there once I understand it completely.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Ravi, Thanks for the compliment. I will look at the patch for MAPREDUCE-3685 and post my comments there once I understand it completely. – Asokan
          Hide
          Ravi Prakash added a comment -

          Hi Mariappan,

          This is a tangent to point 1. The mergeFactor is set to the configured value for IntermediateMemoryToMemoryMerger but to Integer.MAX_VALUE for InMemoryMerger and OnDiskMerger. We have to find out the rationale behind these choices.

          Thanks for all your work on the MergeManager. It is soooooo much cleaner now! Thanks much.

          Anyway, since you have been in this area of the code, I was wondering if you could please review MAPREDUCE-3685? The mergeFactor for the OnDiskMerger was wrong. For inMemoryMerger it seems to be correct (because io.sort.factor is defined as "The number of streams to merge at once while sorting files. This determines the number of open file handles."). Besides I wonder if we want to really go into the level of detail of the number of fetched cache lines and not just simplify by assuming constant access to all memory. Please consider continuing the discussion there.

          Thanks

          Show
          Ravi Prakash added a comment - Hi Mariappan, This is a tangent to point 1. The mergeFactor is set to the configured value for IntermediateMemoryToMemoryMerger but to Integer.MAX_VALUE for InMemoryMerger and OnDiskMerger. We have to find out the rationale behind these choices. Thanks for all your work on the MergeManager. It is soooooo much cleaner now! Thanks much. Anyway, since you have been in this area of the code, I was wondering if you could please review MAPREDUCE-3685 ? The mergeFactor for the OnDiskMerger was wrong. For inMemoryMerger it seems to be correct (because io.sort.factor is defined as "The number of streams to merge at once while sorting files. This determines the number of open file handles."). Besides I wonder if we want to really go into the level of detail of the number of fetched cache lines and not just simplify by assuming constant access to all memory. Please consider continuing the discussion there. Thanks
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Mapreduce-trunk #1292 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1292/)
          MAPREDUCE-4842. Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071)

          Result = FAILURE
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Show
          Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #1292 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1292/ ) MAPREDUCE-4842 . Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071) Result = FAILURE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #1262 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1262/)
          MAPREDUCE-4842. Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071)

          Result = FAILURE
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #1262 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1262/ ) MAPREDUCE-4842 . Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071) Result = FAILURE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-0.23-Build #471 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/471/)
          svn merge -c 1425071 FIXES: MAPREDUCE-4842. Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425075)

          Result = UNSTABLE
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425075
          Files :

          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
          • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #471 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/471/ ) svn merge -c 1425071 FIXES: MAPREDUCE-4842 . Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425075) Result = UNSTABLE jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425075 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Yarn-trunk #73 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/73/)
          MAPREDUCE-4842. Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071)

          Result = SUCCESS
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Show
          Hudson added a comment - Integrated in Hadoop-Yarn-trunk #73 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/73/ ) MAPREDUCE-4842 . Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Hide
          Mariappan Asokan added a comment -

          Hi Jason,
          It was a pleasure working with all of you. I know this race condition is very hard to reproduce let alone debug. You did an excellent job. All your feedback and challenges encouraged me to find the best possible solution.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jason, It was a pleasure working with all of you. I know this race condition is very hard to reproduce let alone debug. You did an excellent job. All your feedback and challenges encouraged me to find the best possible solution. – Asokan
          Hide
          Hudson added a comment -

          Integrated in Hadoop-trunk-Commit #3149 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3149/)
          MAPREDUCE-4842. Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071)

          Result = SUCCESS
          jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071
          Files :

          • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java
          • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Show
          Hudson added a comment - Integrated in Hadoop-trunk-Commit #3149 (See https://builds.apache.org/job/Hadoop-trunk-Commit/3149/ ) MAPREDUCE-4842 . Shuffle race can hang reducer. Contributed by Mariappan Asokan (Revision 1425071) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1425071 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeManager.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/MergeThread.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMergeManager.java
          Hide
          Jason Lowe added a comment -

          Thanks, Mariappan! I committed this to trunk, branch-2, and branch-0.23.

          Show
          Jason Lowe added a comment - Thanks, Mariappan! I committed this to trunk, branch-2, and branch-0.23.
          Hide
          Jason Lowe added a comment -

          Yes, I suppose it's OK in that sense. I think it's oddly written to have it set the value to zero but really -1 because of the finally block. But in the end, that's a nit and not necessary to fix.

          +1 for the patch, will commit shortly.

          Show
          Jason Lowe added a comment - Yes, I suppose it's OK in that sense. I think it's oddly written to have it set the value to zero but really -1 because of the finally block. But in the end, that's a nit and not necessary to fix. +1 for the patch, will commit shortly.
          Hide
          Mariappan Asokan added a comment -

          Hi Jason,
          When the exceptions happen, the thread will terminate(there is a return inside the catch blocks.) It is okay if numPending ends up being -1. The method waitForMerge() will return immediately. From the point of view of users of the class, there is no problem.

          If you have any more questions, please let me know. Otherwise, I think the fix is good.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jason, When the exceptions happen, the thread will terminate(there is a return inside the catch blocks.) It is okay if numPending ends up being -1. The method waitForMerge() will return immediately. From the point of view of users of the class, there is no problem. If you have any more questions, please let me know. Otherwise, I think the fix is good. – Asokan
          Hide
          Jason Lowe added a comment -

          This still isn't quite right. If an exception occurs during the merge, numPending will be set to 0 and then decremented to -1 by the finally block. If we're going to explicitly set the value to 0 for exceptions then we shouldn't be decrementing in the finally block. Instead we can decrement in the try block after the merge completes.

          Show
          Jason Lowe added a comment - This still isn't quite right. If an exception occurs during the merge, numPending will be set to 0 and then decremented to -1 by the finally block. If we're going to explicitly set the value to 0 for exceptions then we shouldn't be decrementing in the finally block. Instead we can decrement in the try block after the merge completes.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12562016/mapreduce-4842.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3153//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3153//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562016/mapreduce-4842.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3153//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3153//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Hi Jason,
          Thanks for your comments. I think the race condition exists because inProgress is a boolean. I changed it to AtomicInteger and called it numPending. There should not be any more race condition. Please provide your feedback.

          Hi Siddharth,
          I understand your concern on the time it is taking. If we fix this properly, we do not have to come back to this issue later. Jason seems to be reviewing my patch.

          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jason, Thanks for your comments. I think the race condition exists because inProgress is a boolean. I changed it to AtomicInteger and called it numPending. There should not be any more race condition. Please provide your feedback. Hi Siddharth, I understand your concern on the time it is taking. If we fix this properly, we do not have to come back to this issue later. Jason seems to be reviewing my patch. Thanks. – Asokan
          Hide
          Siddharth Seth added a comment -

          Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch to fix this blocker,...

          Main intention was to get this committed faster. In terms of review time - the first patch looked simpler. If someone is doing a detailed review, I have absolutely no issues with the patch.

          Show
          Siddharth Seth added a comment - Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch to fix this blocker,... Main intention was to get this committed faster. In terms of review time - the first patch looked simpler. If someone is doing a detailed review, I have absolutely no issues with the patch.
          Hide
          Jason Lowe added a comment -

          I'm mostly OK with the latest patch except for one issue. Now that inProgress is being set after the input is queued up, we have a different kind of race. It's unlikely but theoretically possible:

          1. startMerge() queues up the input and wakes up the merging thread
          2. merging thread wakes up, completes the merge quickly, and sets inProgress to false
          3. startMerge() finally sets inProgress to true, and now we have inProgress set to true with no merge in progress.

          I'd prefer the inProgress setting in startMerge() was moved back to before the input is queued up and the wakeup occurs. There's still a race where it could blip back to false before it gets set back to true within the run() method, but that's a benign race. There's always going to be a race regarding inProgress given asynchronous producers and consumer, we just need to make the race outcomes safe.

          Show
          Jason Lowe added a comment - I'm mostly OK with the latest patch except for one issue. Now that inProgress is being set after the input is queued up, we have a different kind of race. It's unlikely but theoretically possible: startMerge() queues up the input and wakes up the merging thread merging thread wakes up, completes the merge quickly, and sets inProgress to false startMerge() finally sets inProgress to true, and now we have inProgress set to true with no merge in progress. I'd prefer the inProgress setting in startMerge() was moved back to before the input is queued up and the wakeup occurs. There's still a race where it could blip back to false before it gets set back to true within the run() method, but that's a benign race. There's always going to be a race regarding inProgress given asynchronous producers and consumer, we just need to make the race outcomes safe.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561978/mapreduce-4842.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3150//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3150//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561978/mapreduce-4842.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3150//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3150//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Made it more robust. Set inProgress to true at the end of startMerge() as well.

          – Asokan

          Show
          Mariappan Asokan added a comment - Made it more robust. Set inProgress to true at the end of startMerge() as well. – Asokan
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561957/mapreduce-4842.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3149//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3149//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561957/mapreduce-4842.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3149//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3149//console This message is automatically generated.
          Hide
          Mariappan Asokan added a comment -

          Hi Jason,
          Thanks for the quick review of the patch. I put the list clearing in a synchronized block. I set inProgress to true before starting a merge. I shamelessly grabbed your unit test and incorporated in the patch. Please take a look at it.

          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jason, Thanks for the quick review of the patch. I put the list clearing in a synchronized block. I set inProgress to true before starting a merge. I shamelessly grabbed your unit test and incorporated in the patch. Please take a look at it. Thanks. – Asokan
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561943/mapreduce-4842.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3148//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3148//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561943/mapreduce-4842.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3148//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3148//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12561943/mapreduce-4842.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3147//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3147//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12561943/mapreduce-4842.patch against trunk revision . +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . The javadoc tool did not generate any warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3147//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3147//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          Patch race, sorry about that Asokan. Took a look at your most recent patch, couple of comments:

          • I see we're now clearing the lists when certain exceptions are caught, but we're not holding a lock on the list when doing so?
          • Per my previous comment, I think there is a race regarding inProgress where we can do a merge with it set to false.
          • Patch will need a unit test, feel free to grab the test from my previous patch or roll your own if you have a cleaner one in mind.
          Show
          Jason Lowe added a comment - Patch race, sorry about that Asokan. Took a look at your most recent patch, couple of comments: I see we're now clearing the lists when certain exceptions are caught, but we're not holding a lock on the list when doing so? Per my previous comment, I think there is a race regarding inProgress where we can do a merge with it set to false. Patch will need a unit test, feel free to grab the test from my previous patch or roll your own if you have a cleaner one in mind.
          Hide
          Mariappan Asokan added a comment -

          Thanks to all of you for your comments. Thanks to Thomas for the testing. I will address your comments below:

          Siddharth Seth:

          ReduceTask logs should have the exception. I didn't look in detail,...

          I addressed this issue in the patch I am going to upload.

          Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch to fix this blocker,...

          I have to respectfully disagree on this for the following reasons:

          • it did not address the root cause of the problem namely the race condition due to the isInProgress() method. I strongly feel that it should be removed.
          • It is still going through premature merge passes before enough map outputs are shuffled to memory resulting in potential performance issue.
          • The patch is more complicated(adding unnecessary new method in the base class and forcing all subclasses to implement it even though this problem exists only for in-memory merger.)

          Jason Lowe:

          Regarding the removal of the element in the finally block, I'm not sure why we're waiting until after merging...

          Thanks for pointing this out. I realized this and in the new patch I am uploading, I took care of that.

          Speaking of the finally block, I'm also curious if we really want to only notify others of the merge completing if there are no further merges pending. Arguably we should wake them up as soon as any merge completes, as it did previously, because usedMemory should have been lowered during the merge and would allow more shuffle data to be fetched into memory. Waiting until there are no more merges pending means we can't pipeline the shuffle data fetch with ongoing merges if all the fetchers are waiting for the merge so memory can be freed. Waking up waiters on any merge completion means we don't need to lock pendingToBeMerged at all in the finally block (once we also make the change suggested above) and the finally block reverts to what it was originally.

          Good point. However, if we revert the finally block it may cause the call to close() to return prematurely even though one more merge is pending. Any performance gain due to the additional fetches is moot. First, the pending merge would have used up most of the memory. Also, the merge is data-dependent. It may not free up memory sooner. This will result in stalled map output.

          I really appreciate all your comments and the testing. I will upload the new patch shortly.

          – Asokan

          Show
          Mariappan Asokan added a comment - Thanks to all of you for your comments. Thanks to Thomas for the testing. I will address your comments below: Siddharth Seth: ReduceTask logs should have the exception. I didn't look in detail,... I addressed this issue in the patch I am going to upload. Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch to fix this blocker,... I have to respectfully disagree on this for the following reasons: it did not address the root cause of the problem namely the race condition due to the isInProgress() method. I strongly feel that it should be removed. It is still going through premature merge passes before enough map outputs are shuffled to memory resulting in potential performance issue. The patch is more complicated(adding unnecessary new method in the base class and forcing all subclasses to implement it even though this problem exists only for in-memory merger.) Jason Lowe: Regarding the removal of the element in the finally block, I'm not sure why we're waiting until after merging... Thanks for pointing this out. I realized this and in the new patch I am uploading, I took care of that. Speaking of the finally block, I'm also curious if we really want to only notify others of the merge completing if there are no further merges pending. Arguably we should wake them up as soon as any merge completes, as it did previously, because usedMemory should have been lowered during the merge and would allow more shuffle data to be fetched into memory. Waiting until there are no more merges pending means we can't pipeline the shuffle data fetch with ongoing merges if all the fetchers are waiting for the merge so memory can be freed. Waking up waiters on any merge completion means we don't need to lock pendingToBeMerged at all in the finally block (once we also make the change suggested above) and the finally block reverts to what it was originally. Good point. However, if we revert the finally block it may cause the call to close() to return prematurely even though one more merge is pending. Any performance gain due to the additional fetches is moot. First, the pending merge would have used up most of the memory. Also, the merge is data-dependent. It may not free up memory sooner. This will result in stalled map output. I really appreciate all your comments and the testing. I will upload the new patch shortly. – Asokan
          Hide
          Jason Lowe added a comment -

          In the interest of trying to push this forward faster, here's another version of Asokan's patch with the unit test from the original patch added. I also implemented the removeFirst() instead of getFirst() change, and I fixed one more issue. The last patch had a race regarding inProgress where startMerge() could set it to true, but a merge could be completing simultaneously and smash it back to false. Then we'd run a merge without having inProgress as true during the merge, which is Not Good when it comes to getting the fetchers to try to wait when they should.

          This patch does not implement the pipelining idea yet since the performance tests indicate that it might not be necessary to achieve equivalent performance. Implementing it should be fairly straightforward. For example, we could add a volatile mergeCount field that is incremented when merges complete. waitForMerge() would cache the value in a local on entry and return when either inProgress is false or mergeCount has changed (i.e.: we are waiting for any active merge to complete, not all active merges).

          Show
          Jason Lowe added a comment - In the interest of trying to push this forward faster, here's another version of Asokan's patch with the unit test from the original patch added. I also implemented the removeFirst() instead of getFirst() change, and I fixed one more issue. The last patch had a race regarding inProgress where startMerge() could set it to true, but a merge could be completing simultaneously and smash it back to false. Then we'd run a merge without having inProgress as true during the merge, which is Not Good when it comes to getting the fetchers to try to wait when they should. This patch does not implement the pipelining idea yet since the performance tests indicate that it might not be necessary to achieve equivalent performance. Implementing it should be fairly straightforward. For example, we could add a volatile mergeCount field that is incremented when merges complete. waitForMerge() would cache the value in a local on entry and return when either inProgress is false or mergeCount has changed (i.e.: we are waiting for any active merge to complete, not all active merges).
          Hide
          Jason Lowe added a comment -

          Regarding the removal of the element in the finally block, I'm not sure why we're waiting until after merging to remove the element from the list. The list is private, nobody should be trying to examine/walk it mid-merge, and it seems much simpler to dequeue the element being processed before processing rather than waiting until the end. Basically pendingToBeMerged.getFirst() becomes pendingToBeMerged.removeFirst() and we don't need to remember to remove it in the finally block.

          Speaking of the finally block, I'm also curious if we really want to only notify others of the merge completing if there are no further merges pending. Arguably we should wake them up as soon as any merge completes, as it did previously, because usedMemory should have been lowered during the merge and would allow more shuffle data to be fetched into memory. Waiting until there are no more merges pending means we can't pipeline the shuffle data fetch with ongoing merges if all the fetchers are waiting for the merge so memory can be freed. Waking up waiters on any merge completion means we don't need to lock pendingToBeMerged at all in the finally block (once we also make the change suggested above) and the finally block reverts to what it was originally.

          Show
          Jason Lowe added a comment - Regarding the removal of the element in the finally block, I'm not sure why we're waiting until after merging to remove the element from the list. The list is private, nobody should be trying to examine/walk it mid-merge, and it seems much simpler to dequeue the element being processed before processing rather than waiting until the end. Basically pendingToBeMerged.getFirst() becomes pendingToBeMerged.removeFirst() and we don't need to remember to remove it in the finally block. Speaking of the finally block, I'm also curious if we really want to only notify others of the merge completing if there are no further merges pending. Arguably we should wake them up as soon as any merge completes, as it did previously, because usedMemory should have been lowered during the merge and would allow more shuffle data to be fetched into memory. Waiting until there are no more merges pending means we can't pipeline the shuffle data fetch with ongoing merges if all the fetchers are waiting for the merge so memory can be freed. Waking up waiters on any merge completion means we don't need to lock pendingToBeMerged at all in the finally block (once we also make the change suggested above) and the finally block reverts to what it was originally.
          Hide
          Alejandro Abdelnur added a comment -

          If Asokan's approach addresses the problem of this JIRA while simplifying the existing code and it does not introduce regressions, why not use for this JIRA and improve it with a follow up JIRA(s)?

          Show
          Alejandro Abdelnur added a comment - If Asokan's approach addresses the problem of this JIRA while simplifying the existing code and it does not introduce regressions, why not use for this JIRA and improve it with a follow up JIRA(s)?
          Hide
          Thomas Graves added a comment -

          I ran the latest patch through some perf tests. Gridmix and shuffle both look good with the same performance, sort is slightly faster with this patch.

          Show
          Thomas Graves added a comment - I ran the latest patch through some perf tests. Gridmix and shuffle both look good with the same performance, sort is slightly faster with this patch.
          Hide
          Siddharth Seth added a comment -

          You are right that such a scenario is possible. However, the fetcher thread will be waiting in waitForInMemoryMerge() or it may get stalled map output. This may mitigate the problem. I have an idea on how to eliminate this problem completely. I will verify that it will work and post it as part of the patch later. It will be simple, I promise

          Fetches can already be in progress. I did see multiple single file merges with the patch applied; the tera-sort example that I ran - ended up with 6 on disk files to merge instead of 3 in the current implementation. I'm not sure why the Fetcher is waiting for the InMemoryMerge to complete. IAC, your latest patch likely takes care of this.

          Can you describe a scenario when this might be a problem? We can address that too.

          ReduceTask logs should have the exception. I didn't look in detail, but I believe it's caused by a notify after all merges are complete - and there's an attempt to remove an element from the finally block.

          Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch to fix this blocker, with a follow up jira to cleanup the code with the patch you posted - this is assuming of-course that there isn't a degradation in performance. The original patch isn't doing too much other than checking for whether a merge can run after the existing merge completes. It's a bigger patch, but simpler in terms of functional changes.

          Show
          Siddharth Seth added a comment - You are right that such a scenario is possible. However, the fetcher thread will be waiting in waitForInMemoryMerge() or it may get stalled map output. This may mitigate the problem. I have an idea on how to eliminate this problem completely. I will verify that it will work and post it as part of the patch later. It will be simple, I promise Fetches can already be in progress. I did see multiple single file merges with the patch applied; the tera-sort example that I ran - ended up with 6 on disk files to merge instead of 3 in the current implementation. I'm not sure why the Fetcher is waiting for the InMemoryMerge to complete. IAC, your latest patch likely takes care of this. Can you describe a scenario when this might be a problem? We can address that too. ReduceTask logs should have the exception. I didn't look in detail, but I believe it's caused by a notify after all merges are complete - and there's an attempt to remove an element from the finally block. Asokan, for this specific JIRA, I'd, at least, be more comfortable with Arun/Jason's patch to fix this blocker, with a follow up jira to cleanup the code with the patch you posted - this is assuming of-course that there isn't a degradation in performance. The original patch isn't doing too much other than checking for whether a merge can run after the existing merge completes. It's a bigger patch, but simpler in terms of functional changes.
          Hide
          Mariappan Asokan added a comment -

          I updated patch. All the changes are in MergeManager. Here is the outline of changes:

          • Eliminated the line
            commitMemory -= size;
            

            in unreserve() method. Rationale: The complementary method reserve() only increments usedMemory not commitMemory. Besides, commitMemory is used only to decide when we have enough shuffled map outputs in memory to trigger an in-memory merge.

          • In closeInMemoryFile(), once an in-memory merge is submitted, commitMemory is set back to 0. Rationale: If any fetcher thread sneaks in(past the in-memory merge's wait because in-memory merge has not started yet), it will be allowed to shuffle data to memory if memory was freed by the in-memory merger. The value of commitMemory will be incremented from 0 so that another merge will not be triggered unless the number of bytes of data shuffled by sneaked-in threads is greater than or equal to mergeThreshold. This will make sure that we do not start a merge prematurely.
          • Added initialization of usedMemory and commitMemory in the constructor(though this is not needed as the java constructor zeros out these by default.)

          Please test this patch for any performance regression.

          Thanks.

          – Asokan

          Show
          Mariappan Asokan added a comment - I updated patch. All the changes are in MergeManager. Here is the outline of changes: Eliminated the line commitMemory -= size; in unreserve() method. Rationale: The complementary method reserve() only increments usedMemory not commitMemory. Besides, commitMemory is used only to decide when we have enough shuffled map outputs in memory to trigger an in-memory merge. In closeInMemoryFile(), once an in-memory merge is submitted, commitMemory is set back to 0. Rationale: If any fetcher thread sneaks in(past the in-memory merge's wait because in-memory merge has not started yet), it will be allowed to shuffle data to memory if memory was freed by the in-memory merger. The value of commitMemory will be incremented from 0 so that another merge will not be triggered unless the number of bytes of data shuffled by sneaked-in threads is greater than or equal to mergeThreshold. This will make sure that we do not start a merge prematurely. Added initialization of usedMemory and commitMemory in the constructor(though this is not needed as the java constructor zeros out these by default.) Please test this patch for any performance regression. Thanks. – Asokan
          Hide
          Mariappan Asokan added a comment -

          Hi Jason, Thomas, and Siddharth,
          Thanks for running the tests and reporting your findings. My patch was intended to eliminate the race condition due to the isInProgress() method in MergeThread. One cannot check the state of a thread and then take an action based on the state because the state might change before the action is taken. The state checking and action should be atomic. So I came up with a solution to get rid of that method.

          I was not intending to change the existing logic on when an in-memory merge is triggered. Also, I was not expecting any performance improvement or degradation due to this change. There might be very little improvement in the overall performance due to the elimination of 'synchronized' calls. However, it simplifies the code.

          Now going to Siddharth's comment:

          Asokan, one issue I can see with the patch - while a merge is in progress, every completed fetch will end up generating a single element list for the merger - effectively getting written out to it's own file.

          You are right that such a scenario is possible. However, the fetcher thread will be waiting in waitForInMemoryMerge() or it may get stalled map output. This may mitigate the problem. I have an idea on how to eliminate this problem completely. I will verify that it will work and post it as part of the patch later. It will be simple, I promise

          Siddharth, you state:

          Also, there's a couple of exceptions from MergeThread.run during shutdown, which would need to be addressed, if this approach is being taken.

          Can you describe a scenario when this might be a problem? We can address that too.

          Once again, thanks to all of you.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jason, Thomas, and Siddharth, Thanks for running the tests and reporting your findings. My patch was intended to eliminate the race condition due to the isInProgress() method in MergeThread. One cannot check the state of a thread and then take an action based on the state because the state might change before the action is taken. The state checking and action should be atomic. So I came up with a solution to get rid of that method. I was not intending to change the existing logic on when an in-memory merge is triggered. Also, I was not expecting any performance improvement or degradation due to this change. There might be very little improvement in the overall performance due to the elimination of 'synchronized' calls. However, it simplifies the code. Now going to Siddharth's comment: Asokan, one issue I can see with the patch - while a merge is in progress, every completed fetch will end up generating a single element list for the merger - effectively getting written out to it's own file. You are right that such a scenario is possible. However, the fetcher thread will be waiting in waitForInMemoryMerge() or it may get stalled map output. This may mitigate the problem. I have an idea on how to eliminate this problem completely. I will verify that it will work and post it as part of the patch later. It will be simple, I promise Siddharth, you state: Also, there's a couple of exceptions from MergeThread.run during shutdown, which would need to be addressed, if this approach is being taken. Can you describe a scenario when this might be a problem? We can address that too. Once again, thanks to all of you. – Asokan
          Hide
          Thomas Graves added a comment -

          I ran some shuffle and sort tests and on the shuffle test job times were ~90 seconds worse with Asokan's patch then without, with that time being take by reducers. Sort test was showing a wide variation. One run took 680 seconds, the next 770 seconds. I don't normally see that much difference between runs. The runs without the patch were 680 and 700 seconds.

          Show
          Thomas Graves added a comment - I ran some shuffle and sort tests and on the shuffle test job times were ~90 seconds worse with Asokan's patch then without, with that time being take by reducers. Sort test was showing a wide variation. One run took 680 seconds, the next 770 seconds. I don't normally see that much difference between runs. The runs without the patch were 680 and 700 seconds.
          Hide
          Siddharth Seth added a comment -

          Asokan, one issue I can see with the patch - while a merge is in progress, every completed fetch will end up generating a single element list for the merger - effectively getting written out to it's own file. Once the initial merge nears completion - and the inputs are closed, commitMemory will go back down and allow the next merge list to be larger. For bigger jobs - this will likely hurt performance. Controlling number of files per merge-list as well as potentially avoiding the last merge seem to be required.
          Also, there's a couple of exceptions from MergeThread.run during shutdown, which would need to be addressed, if this approach is being taken.
          Not sure about what causes the slightly improved performance (would expect it to be a little worse in certain situations) - it does remove some of the synchronized checks on merger.isInProgress and in the individual mergers - don't think that explains it though. Any thoughts on what would explain the difference in performance ?

          One idea I wanted to try is to change the patch to only trigger a merge after a merge completes if we're convinced there are no outstanding fetchers that would trigger it later (e.g.: only trigger if merge conditions are met and commitMemory == usedMemory, IIRC).

          That could also prevent a last merge being written to disk - on completion of the last fetcher. Right now, this seems to be dependent on that status of the merger and occupied memory.

          Show
          Siddharth Seth added a comment - Asokan, one issue I can see with the patch - while a merge is in progress, every completed fetch will end up generating a single element list for the merger - effectively getting written out to it's own file. Once the initial merge nears completion - and the inputs are closed, commitMemory will go back down and allow the next merge list to be larger. For bigger jobs - this will likely hurt performance. Controlling number of files per merge-list as well as potentially avoiding the last merge seem to be required. Also, there's a couple of exceptions from MergeThread.run during shutdown, which would need to be addressed, if this approach is being taken. Not sure about what causes the slightly improved performance (would expect it to be a little worse in certain situations) - it does remove some of the synchronized checks on merger.isInProgress and in the individual mergers - don't think that explains it though. Any thoughts on what would explain the difference in performance ? One idea I wanted to try is to change the patch to only trigger a merge after a merge completes if we're convinced there are no outstanding fetchers that would trigger it later (e.g.: only trigger if merge conditions are met and commitMemory == usedMemory, IIRC). That could also prevent a last merge being written to disk - on completion of the last fetcher. Right now, this seems to be dependent on that status of the merger and occupied memory.
          Hide
          Thomas Graves added a comment -

          I've run Asokan's patch through Gridmix also. I ran it on 2 different clusters - 300 node (older hardpertown hardware) and 60 node (westmere hardware). 2 different trace files were run, ones with 1200 jobs and one with ~200 jobs. Both are showing a slight improvement similar to Alejandro's results. 1-2 minutes better with Asokan's patch. I'm waiting for the original patch from Arun and Jason to run through as a double check to results we were seeing before (where it was worse).

          I am also going to run his patch through a few of the other benchmark suites to make sure Gridmix didn't miss something.

          Show
          Thomas Graves added a comment - I've run Asokan's patch through Gridmix also. I ran it on 2 different clusters - 300 node (older hardpertown hardware) and 60 node (westmere hardware). 2 different trace files were run, ones with 1200 jobs and one with ~200 jobs. Both are showing a slight improvement similar to Alejandro's results. 1-2 minutes better with Asokan's patch. I'm waiting for the original patch from Arun and Jason to run through as a double check to results we were seeing before (where it was worse). I am also going to run his patch through a few of the other benchmark suites to make sure Gridmix didn't miss something.
          Hide
          Jason Lowe added a comment -

          Thanks for the gridmix stats, Alejandro. Sorry, I've been swamped with other issues and haven't had much time to spend on this. We plan on running Asokan's patch through a 350 machine gridmix run soon, hopefully today.

          Show
          Jason Lowe added a comment - Thanks for the gridmix stats, Alejandro. Sorry, I've been swamped with other issues and haven't had much time to spend on this. We plan on running Asokan's patch through a 350 machine gridmix run soon, hopefully today.
          Hide
          Alejandro Abdelnur added a comment -

          I had a cluster setup with trunk to run some gridmix tests (for MAPREDUCE-2454) and before shutting it down I've done a couple of runs using Asokan's patch.

          35 machines cluster. The trace had ~1000 jobs. I've done 2 runs with trunk and 2 runs with the patch.

          TRUNK:

          Time spent in simulation: 43mins, 31sec
          Time spent in simulation: 41mins, 28sec

          MAPREDUCE-4842

          Time spent in simulation: 39mins, 30sec
          Time spent in simulation: 39mins, 25sec

          It would worth looking if it could be modified to control the number of merges being created.

          Show
          Alejandro Abdelnur added a comment - I had a cluster setup with trunk to run some gridmix tests (for MAPREDUCE-2454 ) and before shutting it down I've done a couple of runs using Asokan's patch. 35 machines cluster. The trace had ~1000 jobs. I've done 2 runs with trunk and 2 runs with the patch. TRUNK: Time spent in simulation: 43mins, 31sec Time spent in simulation: 41mins, 28sec MAPREDUCE-4842 Time spent in simulation: 39mins, 30sec Time spent in simulation: 39mins, 25sec It would worth looking if it could be modified to control the number of merges being created.
          Hide
          Mariappan Asokan added a comment -

          Hi Jason,
          I have uploaded the patch with a caveat that it was not put to stress test

          You stated the following:

          We ran this patch through gridmix, and there are some indications it may negatively affect the performance of shuffle/merge for reducers. Not quite sure why, yet, as I haven't had time to investigate. Maybe since this patch checks for starting merges more often we end up starting merges too early and end up creating more work than if we wait for a fetcher to commit first?

          1. Did you look at the log files to see the messages logged from startMerge() method in MergeThread? It tries to merge at most mergeFactor map outputs at a time. Do you see any differences in the messages with and without your patch since you are guessing that "we end up starting merges too early."
          1. This is a tangent to point 1. The mergeFactor is set to the configured value for IntermediateMemoryToMemoryMerger but to Integer.MAX_VALUE for InMemoryMerger and OnDiskMerger. We have to find out the rationale behind these choices.
          1. You are right that in my patch I did not make any change to the logic on when to start the merge.

          Let us compare the logs(with and without the patches) and go from there for any conclusions.

          Thanks for sharing the information.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jason, I have uploaded the patch with a caveat that it was not put to stress test You stated the following: We ran this patch through gridmix, and there are some indications it may negatively affect the performance of shuffle/merge for reducers. Not quite sure why, yet, as I haven't had time to investigate. Maybe since this patch checks for starting merges more often we end up starting merges too early and end up creating more work than if we wait for a fetcher to commit first? Did you look at the log files to see the messages logged from startMerge() method in MergeThread ? It tries to merge at most mergeFactor map outputs at a time. Do you see any differences in the messages with and without your patch since you are guessing that "we end up starting merges too early." This is a tangent to point 1. The mergeFactor is set to the configured value for IntermediateMemoryToMemoryMerger but to Integer.MAX_VALUE for InMemoryMerger and OnDiskMerger. We have to find out the rationale behind these choices. You are right that in my patch I did not make any change to the logic on when to start the merge. Let us compare the logs(with and without the patches) and go from there for any conclusions. Thanks for sharing the information. – Asokan
          Hide
          Jason Lowe added a comment -

          If the problem is we are creating too many merges, it seems Asokan's approach would have the same issue, no? We would schedule merges immediately upon hitting the commit threshold since it wouldn't delay if a merge was in progress, rather it would queue up that next merge chunk on the list. Or maybe I'm misunderstanding the proposed change?

          Asokan, please post a patch. It would help ensure we all are on the same page. Thanks!

          Show
          Jason Lowe added a comment - If the problem is we are creating too many merges, it seems Asokan's approach would have the same issue, no? We would schedule merges immediately upon hitting the commit threshold since it wouldn't delay if a merge was in progress, rather it would queue up that next merge chunk on the list. Or maybe I'm misunderstanding the proposed change? Asokan, please post a patch. It would help ensure we all are on the same page. Thanks!
          Hide
          Alejandro Abdelnur added a comment -

          Jason, thanks for the detailed explanation. On the degradation, maybe it would be a point to look at Asokan's approach to see if it is correct and if does not impact the performance, what do you think?

          Show
          Alejandro Abdelnur added a comment - Jason, thanks for the detailed explanation. On the degradation, maybe it would be a point to look at Asokan's approach to see if it is correct and if does not impact the performance, what do you think?
          Hide
          Jason Lowe added a comment -

          Unfortunately no, I don't have an easy repro case. This is something I noticed happened to a job someone was running on one of our clusters. It's a race condition between fetchers and merging, and I'm not sure even with the same cluster config and job it will easily reproduce.

          We ran this patch through gridmix, and there are some indications it may negatively affect the performance of shuffle/merge for reducers. Not quite sure why, yet, as I haven't had time to investigate. Maybe since this patch checks for starting merges more often we end up starting merges too early and end up creating more work than if we wait for a fetcher to commit first? One idea I wanted to try is to change the patch to only trigger a merge after a merge completes if we're convinced there are no outstanding fetchers that would trigger it later (e.g.: only trigger if merge conditions are met and commitMemory == usedMemory, IIRC).

          Show
          Jason Lowe added a comment - Unfortunately no, I don't have an easy repro case. This is something I noticed happened to a job someone was running on one of our clusters. It's a race condition between fetchers and merging, and I'm not sure even with the same cluster config and job it will easily reproduce. We ran this patch through gridmix, and there are some indications it may negatively affect the performance of shuffle/merge for reducers. Not quite sure why, yet, as I haven't had time to investigate. Maybe since this patch checks for starting merges more often we end up starting merges too early and end up creating more work than if we wait for a fetcher to commit first? One idea I wanted to try is to change the patch to only trigger a merge after a merge completes if we're convinced there are no outstanding fetchers that would trigger it later (e.g.: only trigger if merge conditions are met and commitMemory == usedMemory, IIRC).
          Hide
          Alejandro Abdelnur added a comment -

          Jason, do you happen to have a cluster config and job that easily reproduces the problem? Thx

          Show
          Alejandro Abdelnur added a comment - Jason, do you happen to have a cluster config and job that easily reproduces the problem? Thx
          Hide
          Mariappan Asokan added a comment -

          Hi Jason, Arun, and Alejandro,
          I came up with a simpler solution to solve this nasty problem. Instead of a single list inputs in MergeThread, we can keep a FIFO list of these lists. This will make sure that more than one merge can be pending. The run() method in MergeThread will keep pulling out the map output lists from the FIFO list to merge them(this is a typical producer-consumer scenario.)

          I will outline the changes below:

          In MergeThread,

          • A LinkedList<List<T>> type member(pendingToBeMerged) is added and the member inputs is removed.
          • The isInProgress() method is removed.
          • The startMerge() method will no longer be synchronized. It will add the passed list to the tail of pendingToBeMerged and it will notifyAll() on the monitor of pendingToBeMerged.
          • The run() method will sit in a tight loop. So long as there is an item(list of map outputs) to be consumed, it will consume(merge) the item and remove it from pendingToBeMerged. If {pendingToBeMerged}} has no more item, it will notifyAll() on the object's monitor after setting
            inProgress to false.

          In MergeManager,

          • All calls to isInProgress() are removed.
          • Unnecessary synchronized clauses on merge thread objects are removed since the methods where they are in themselves are synchronized.

          I created a patch with the above changes and tested it on my laptop. The mapreduce tests seem to run without any problem. However, I do not claim that it is completely tested. It has to go through the rigorous testing that Jason did.

          If you are interested in taking a look at the patch, I will post it to this Jira. I welcome your questions and suggestions on the idea of the patch.

          – Asokan

          Show
          Mariappan Asokan added a comment - Hi Jason, Arun, and Alejandro, I came up with a simpler solution to solve this nasty problem. Instead of a single list inputs in MergeThread, we can keep a FIFO list of these lists. This will make sure that more than one merge can be pending. The run() method in MergeThread will keep pulling out the map output lists from the FIFO list to merge them(this is a typical producer-consumer scenario.) I will outline the changes below: In MergeThread , A LinkedList<List<T>> type member( pendingToBeMerged ) is added and the member inputs is removed. The isInProgress() method is removed. The startMerge() method will no longer be synchronized. It will add the passed list to the tail of pendingToBeMerged and it will notifyAll() on the monitor of pendingToBeMerged. The run() method will sit in a tight loop. So long as there is an item(list of map outputs) to be consumed, it will consume(merge) the item and remove it from pendingToBeMerged. If {pendingToBeMerged}} has no more item, it will notifyAll() on the object's monitor after setting inProgress to false. In MergeManager , All calls to isInProgress() are removed. Unnecessary synchronized clauses on merge thread objects are removed since the methods where they are in themselves are synchronized. I created a patch with the above changes and tested it on my laptop. The mapreduce tests seem to run without any problem. However, I do not claim that it is completely tested. It has to go through the rigorous testing that Jason did. If you are interested in taking a look at the patch, I will post it to this Jira. I welcome your questions and suggestions on the idea of the patch. – Asokan
          Hide
          Jason Lowe added a comment -

          Thanks for the reviews, Alejandro and Arun. I updated the patch to address Alejandro's comment and also added a comment clarifying why the merge callback occurs outside of the lock and after inProgress is cleared per a side discussion with Arun.

          Show
          Jason Lowe added a comment - Thanks for the reviews, Alejandro and Arun. I updated the patch to address Alejandro's comment and also added a comment clarifying why the merge callback occurs outside of the lock and after inProgress is cleared per a side discussion with Arun.
          Hide
          Alejandro Abdelnur added a comment -

          One minor NIT, the scope of exceptionReporter instance var has been changed from private to protected for testing purposes. It should be package private instead. And preferable, we should add a getter method instead, package private (it could be annotated with the visiblefortesting guava annotation). Other than that looks good to me.

          Show
          Alejandro Abdelnur added a comment - One minor NIT, the scope of exceptionReporter instance var has been changed from private to protected for testing purposes. It should be package private instead. And preferable, we should add a getter method instead, package private (it could be annotated with the visiblefortesting guava annotation). Other than that looks good to me.
          Hide
          Arun C Murthy added a comment -

          Jason, nice unit test! Thanks!

          I've modified it a little to have 2 barriers (mergeStart and mergeComplete) rather than use the same 4 times (confused me a lot when I was reviewing it).

          Other than that, it looks great. +1

          Also, if you don't mind, I'll assign the jira to you - since you've done all the heavy lifting and deserve way more credit than I do. Thanks again!

          Show
          Arun C Murthy added a comment - Jason, nice unit test! Thanks! I've modified it a little to have 2 barriers (mergeStart and mergeComplete) rather than use the same 4 times (confused me a lot when I was reviewing it). Other than that, it looks great. +1 Also, if you don't mind, I'll assign the jira to you - since you've done all the heavy lifting and deserve way more credit than I do. Thanks again!
          Hide
          Jason Lowe added a comment -

          Updated the patch to add a test case and rename checkAndRestartMerge to onSuccessfulMerge

          Show
          Jason Lowe added a comment - Updated the patch to add a test case and rename checkAndRestartMerge to onSuccessfulMerge
          Hide
          Jason Lowe added a comment -

          I think this approach will work. One nit is we may want to rename checkAndRestartMerge() to something like onSuccessfulMerge() since that's a more general concept and accurately reflects when the method will be called.

          Show
          Jason Lowe added a comment - I think this approach will work. One nit is we may want to rename checkAndRestartMerge() to something like onSuccessfulMerge() since that's a more general concept and accurately reflects when the method will be called.
          Hide
          Arun C Murthy added a comment -

          Great catch Jason! Thanks!

          It seems like we are missing a hook in MergeThread.run to re-check the condition and trigger another merge at the end of the merge itself.

          Here is an illustrative patch.

          Thoughts?

          Show
          Arun C Murthy added a comment - Great catch Jason! Thanks! It seems like we are missing a hook in MergeThread.run to re-check the condition and trigger another merge at the end of the merge itself. Here is an illustrative patch. Thoughts?
          Hide
          Jason Lowe added a comment -

          Here's the sequence of events that I believe led to the hang during shuffle. See MergeManager for context of variable references.

          1. Fetchers started fetching data
          2. Enough data finishes transferring to reach the commitMemory threshold and an in-memory merge starts
          3. While the merge takes place some of the output data is freed before the merge completes, lowering commitMemory and usedMemory which allows more data to be fetched
          4. Eventually we try to fetch too much data because usedMemory exceeds memoryLimit and further fetchers are told to WAIT
          5. All of the outstanding fetches complete and call closeInMemoryFile, but we don't start a merge because the previous merge is still marked in progress
          6. Merge completes, allowing a new merge to be started on the next closeInMemoryFile call
          7. With no outstanding fetches and no new fetches allowed, we never call closeInMemoryFile again and never start the next merge
          8. With no merge in progress and therefore nothing to wait upon, fetcher threads proceed to pummel the MergeManager asking for merge data reservations that are never given, and the reducer log grows rather rapidly
          Show
          Jason Lowe added a comment - Here's the sequence of events that I believe led to the hang during shuffle. See MergeManager for context of variable references. Fetchers started fetching data Enough data finishes transferring to reach the commitMemory threshold and an in-memory merge starts While the merge takes place some of the output data is freed before the merge completes, lowering commitMemory and usedMemory which allows more data to be fetched Eventually we try to fetch too much data because usedMemory exceeds memoryLimit and further fetchers are told to WAIT All of the outstanding fetches complete and call closeInMemoryFile , but we don't start a merge because the previous merge is still marked in progress Merge completes, allowing a new merge to be started on the next closeInMemoryFile call With no outstanding fetches and no new fetches allowed, we never call closeInMemoryFile again and never start the next merge With no merge in progress and therefore nothing to wait upon, fetcher threads proceed to pummel the MergeManager asking for merge data reservations that are never given, and the reducer log grows rather rapidly

            People

            • Assignee:
              Mariappan Asokan
              Reporter:
              Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development