Hadoop Common
  1. Hadoop Common
  2. HADOOP-3326

ReduceTask should not sleep for 200 ms while waiting for merge to finish

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.18.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Changed fetchOutputs() so that LocalFSMerger and InMemFSMergeThread threads are spawned only once. The thread gets notified when something is ready for merge. The merge happens when thresholds are met.

      Description

      Currently the merge code in Reduce task does:

                  // Wait for the on-disk merge to complete
                  while (localFSMergeInProgress) {
                    Thread.sleep(200);
                  }
                  
                  //wait for an ongoing merge (if it is in flight) to complete
                  while (mergeInProgress) {
                    Thread.sleep(200);
                  }
      
      1. hadoop-3326_4.patch
        25 kB
        Mahadev konar
      2. 3326_3.patch
        25 kB
        Sharad Agarwal
      3. 3326_2.patch
        22 kB
        Sharad Agarwal
      4. 3326_1.patch
        21 kB
        Sharad Agarwal

        Activity

        Hide
        Amar Kamat added a comment -

        Owen,
        This is a one time wait in the lifetime of the reducer. 400ms seems insignificant as compared to the total runtime of the reducer. What is the concern here?

        Show
        Amar Kamat added a comment - Owen, This is a one time wait in the lifetime of the reducer. 400ms seems insignificant as compared to the total runtime of the reducer. What is the concern here?
        Hide
        Devaraj Das added a comment -

        Amar, this is more of a style issue rather than anything else. I'll submit a patch soon.

        Show
        Devaraj Das added a comment - Amar, this is more of a style issue rather than anything else. I'll submit a patch soon.
        Hide
        Sharad Agarwal added a comment -

        Attaching a patch. It does following:-
        1. LocalFSMerger and InMemFSMergeThread threads are spawned only once in fetchOutputs() method. The thread gets notified when something is ready for merge. The merge happens when thresholds are met.

        2. Instead of sleeping (in loop) to wait for finishing the merger threads, now it does join.

        3. The in memory merge code was duplicate at 2 places, now the common code has been moved to doInMemMerge() method.

        Show
        Sharad Agarwal added a comment - Attaching a patch. It does following:- 1. LocalFSMerger and InMemFSMergeThread threads are spawned only once in fetchOutputs() method. The thread gets notified when something is ready for merge. The merge happens when thresholds are met. 2. Instead of sleeping (in loop) to wait for finishing the merger threads, now it does join. 3. The in memory merge code was duplicate at 2 places, now the common code has been moved to doInMemMerge() method.
        Hide
        Amar Kamat added a comment -

        Some comments
        1) Make copiersDone volatile.
        2) I am not sure if changing the ordering of triggering InMem merge and removing it from the neededOutputs is safe. Sharad, can you plz verify this?
        3) Remove extra newlines.
        4) Since there is only one thread we can use notify() instead of notifyAll()
        5) The In-Mem merge trigger condition was

        Condition #1 (Before)
        !mergeInProgress && 
        (inMemFileSys.getPercentUsed() >= MAX_INMEM_FILESYS_USE || 
                       (mergeThreshold > 0 && 
                        inMemFileSys.getNumFiles(MAP_OUTPUT_FILTER) >= 
                        mergeThreshold))&&
                      mergeThrowable == null
        

        was changed to

        Condition #2 (After)
        mergeThreshold<0 && inMemFileSys.getNumFiles(MAP_OUTPUT_FILTER) < mergeThreshold
        

        Since condition #1 was the triggering condition and condition #2 is used to wait for the triggering condition, condition #1 = ~ condition #2. Hence it should be

         inMemFileSys.getPercentUsed() < MAX_INMEM_FILESYS_USE &&  (mergeThreshold <= 0 || inMemFileSys.getNumFiles(MAP_OUTPUT_FILTER) < mergeThreshold) 
        

        Or am I missing something?
        BTW the approach seems fine.

        Show
        Amar Kamat added a comment - Some comments 1) Make copiersDone volatile. 2) I am not sure if changing the ordering of triggering InMem merge and removing it from the neededOutputs is safe. Sharad, can you plz verify this? 3) Remove extra newlines. 4) Since there is only one thread we can use notify() instead of notifyAll() 5) The In-Mem merge trigger condition was Condition #1 (Before) !mergeInProgress && (inMemFileSys.getPercentUsed() >= MAX_INMEM_FILESYS_USE || (mergeThreshold > 0 && inMemFileSys.getNumFiles(MAP_OUTPUT_FILTER) >= mergeThreshold))&& mergeThrowable == null was changed to Condition #2 (After) mergeThreshold<0 && inMemFileSys.getNumFiles(MAP_OUTPUT_FILTER) < mergeThreshold Since condition #1 was the triggering condition and condition #2 is used to wait for the triggering condition, condition #1 = ~ condition #2. Hence it should be inMemFileSys.getPercentUsed() < MAX_INMEM_FILESYS_USE && (mergeThreshold <= 0 || inMemFileSys.getNumFiles(MAP_OUTPUT_FILTER) < mergeThreshold) Or am I missing something? BTW the approach seems fine.
        Hide
        Sharad Agarwal added a comment -

        Attaching the patch. Fixed following
        >> 1) Make copiersDone volatile.
        on second thoughts this seems to be not required. Thread is interrupted for coming out of the wait condition.
        >> 2) I am not sure if changing the ordering of triggering InMem merge and removing it from the neededOutputs is safe. Sharad, can you plz verify this?
        keeping the same order
        >> 3) Remove extra newlines.
        done
        >> 4) Since there is only one thread we can use notify() instead of notifyAll()
        done
        >> 5) The In-Mem merge trigger condition was
        fixed the condition

        Ran the local sort. Both the merge threads looks to be working fine.

        Show
        Sharad Agarwal added a comment - Attaching the patch. Fixed following >> 1) Make copiersDone volatile. on second thoughts this seems to be not required. Thread is interrupted for coming out of the wait condition. >> 2) I am not sure if changing the ordering of triggering InMem merge and removing it from the neededOutputs is safe. Sharad, can you plz verify this? keeping the same order >> 3) Remove extra newlines. done >> 4) Since there is only one thread we can use notify() instead of notifyAll() done >> 5) The In-Mem merge trigger condition was fixed the condition Ran the local sort. Both the merge threads looks to be working fine.
        Hide
        Sharad Agarwal added a comment -

        trying to run hudson

        Show
        Sharad Agarwal added a comment - trying to run hudson
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12382560/3326_2.patch
        against trunk revision 661918.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382560/3326_2.patch against trunk revision 661918. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2537/console This message is automatically generated.
        Hide
        Sharad Agarwal added a comment -

        The failed tests are not related to this patch. The failure happened due to timeout.
        For both the failed test cases, the log message is:
        junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout.

        Test cases have not been written for this as the existing tests will verify the correctness of the results.

        Show
        Sharad Agarwal added a comment - The failed tests are not related to this patch. The failure happened due to timeout. For both the failed test cases, the log message is: junit.framework.AssertionFailedError: Timeout occurred. Please note the time in the report does not reflect the time until the timeout. Test cases have not been written for this as the existing tests will verify the correctness of the results.
        Hide
        Mahadev konar added a comment -

        retrying hudson. The tests needs to be passed before a patch can be considered for commiting.

        Show
        Mahadev konar added a comment - retrying hudson. The tests needs to be passed before a patch can be considered for commiting.
        Hide
        Mahadev konar added a comment -

        retrying hudson

        Show
        Mahadev konar added a comment - retrying hudson
        Hide
        Mahadev konar added a comment -

        the interrupt seems wrong. the lines

        //copiers are done, interrupt the waiting merge threads
        localFSMergerThread.interrupt();
        inMemFSMergeThread.interrupt();

        interrupt both the threads when the copier is done. In that case the work that these threads were doing will not be completed. Menaing theymight still be merging in memory or from local disk and data might end up disappearding if they do not complete the work they are doing. Is that correct?

        Show
        Mahadev konar added a comment - the interrupt seems wrong. the lines //copiers are done, interrupt the waiting merge threads localFSMergerThread.interrupt(); inMemFSMergeThread.interrupt(); interrupt both the threads when the copier is done. In that case the work that these threads were doing will not be completed. Menaing theymight still be merging in memory or from local disk and data might end up disappearding if they do not complete the work they are doing. Is that correct?
        Hide
        Arun C Murthy added a comment -

        Cancelling patch while Mahadev's comments are addressed...

        Show
        Arun C Murthy added a comment - Cancelling patch while Mahadev's comments are addressed...
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12382560/3326_2.patch
        against trunk revision 663370.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12382560/3326_2.patch against trunk revision 663370. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2579/console This message is automatically generated.
        Hide
        Sharad Agarwal added a comment -

        Mahadev, Thanks for the review. Have incorporated your comments. Instead of interrupt, using boolean flags to exit from the thread.
        Also, removed the last inMemory merge from fetchOutputs. Now last inMem merge is also handled by the InMemFSMergeThread.
        Tested sort+validate on 50 nodes cluster.

        Show
        Sharad Agarwal added a comment - Mahadev, Thanks for the review. Have incorporated your comments. Instead of interrupt, using boolean flags to exit from the thread. Also, removed the last inMemory merge from fetchOutputs. Now last inMem merge is also handled by the InMemFSMergeThread. Tested sort+validate on 50 nodes cluster.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12383464/3326_3.patch
        against trunk revision 663487.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383464/3326_3.patch against trunk revision 663487. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2592/console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        +1 the patch looks good.

        Show
        Mahadev konar added a comment - +1 the patch looks good.
        Hide
        Mahadev konar added a comment -

        just edited the patch to update the boolean within the syncrhonized block.

        Show
        Mahadev konar added a comment - just edited the patch to update the boolean within the syncrhonized block.
        Hide
        Arun C Murthy added a comment -

        I agree with Mahadev's change, the test case failures are unrelated and are being addressed via separate jiras.

        Show
        Arun C Murthy added a comment - I agree with Mahadev's change, the test case failures are unrelated and are being addressed via separate jiras.
        Hide
        Arun C Murthy added a comment -

        I just committed this. Thanks, Sharad!

        Show
        Arun C Murthy added a comment - I just committed this. Thanks, Sharad!

          People

          • Assignee:
            Sharad Agarwal
            Reporter:
            Owen O'Malley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development