Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.5.0
    • Fix Version/s: 2.6.0
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      To minimize the number of map fetch failures reported by reducers across an NM restart it would be nice if reducers only reported a fetch failure after trying for at specified period of time to retrieve the data.

      1. MAPREDUCE-5891-v6.patch
        28 kB
        Junping Du
      2. MAPREDUCE-5891-v5.patch
        28 kB
        Junping Du
      3. MAPREDUCE-5891-v4.patch
        28 kB
        Junping Du
      4. MAPREDUCE-5891-v3.patch
        28 kB
        Junping Du
      5. MAPREDUCE-5891-v2.patch
        27 kB
        Junping Du
      6. MAPREDUCE-5891.patch
        26 kB
        Junping Du
      7. MAPREDUCE-5891-demo.patch
        14 kB
        Junping Du

        Issue Links

          Activity

          Hide
          Junping Du added a comment -

          Hi Jason Lowe, I like to work on this if you haven't started to work on it.

          Show
          Junping Du added a comment - Hi Jason Lowe , I like to work on this if you haven't started to work on it.
          Hide
          Jason Lowe added a comment -

          Thanks, Junping! I've been busy with other things and haven't started on this yet.

          I had a few ideas around either having the AM have a "grace period" for reported errors to cover the NM restart window or have the Fetcher only report shuffle errors after trying for a period of time that would cover the NM restart window. But I haven't had a chance to prototype those ideas yet.

          Show
          Jason Lowe added a comment - Thanks, Junping! I've been busy with other things and haven't started on this yet. I had a few ideas around either having the AM have a "grace period" for reported errors to cover the NM restart window or have the Fetcher only report shuffle errors after trying for a period of time that would cover the NM restart window. But I haven't had a chance to prototype those ideas yet.
          Hide
          Junping Du added a comment -

          Thanks Jason Lowe for sharing the thoughts. I have a demo patch to work on fetcher side. It haven't been completed as lacking of tests, but it would be great if you can review and provide some feedback on the way I am choosing.
          In this demo patch,

          • add retry logic in openConnection (previously, we only have retry logic in real connect).
          • add retry logic in copyMapOutput. If IOException get throw in tolerant downtime for NM, the retry logic will rebuild connection and skip mapTask that already get shuffled in previous iteration.
          • refactor some code within copyFromHost() to reuse code as much as possible for concisely purpose.
          Show
          Junping Du added a comment - Thanks Jason Lowe for sharing the thoughts. I have a demo patch to work on fetcher side. It haven't been completed as lacking of tests, but it would be great if you can review and provide some feedback on the way I am choosing. In this demo patch, add retry logic in openConnection (previously, we only have retry logic in real connect). add retry logic in copyMapOutput. If IOException get throw in tolerant downtime for NM, the retry logic will rebuild connection and skip mapTask that already get shuffled in previous iteration. refactor some code within copyFromHost() to reuse code as much as possible for concisely purpose.
          Hide
          Jason Lowe added a comment -

          Thanks for posting a prototype, Junping! It's similar to what I was thinking. Some initial comments from a brief look at the patch:

          It could be very inefficient to re-shuffle a large map output that we already have. If we get an error and need to reconnect then we should rebuild the URL based on the remaining maps rather than keep using the original URL and throwing away data for maps we already processed. In essence we should never reconnect and ask for what we already have.

          It seems like we should only need one-level of retry rather than two. I think we only need to retry at the top-level where we try to copy map outputs. If copying map outputs fails either because of a severed connection, a connection reset/timeout, or whatever and we have not spent too much time trying to recover from errors for the same map output then we retry. As soon as we successfully shuffle a map then we reset the time (because things are working and we're making progress) and move onto the next map output. I don't think we need to retry at this level and also at the socket connection level. The upper level retry will cause us to automatically retry the lower-level socket connection, unless I'm missing something.

          We probably should use Time.monotonicNow or something similar that won't break the desired delay behavior if the clock changes.

          "miliseconds" should be "milliseconds"

          Show
          Jason Lowe added a comment - Thanks for posting a prototype, Junping! It's similar to what I was thinking. Some initial comments from a brief look at the patch: It could be very inefficient to re-shuffle a large map output that we already have. If we get an error and need to reconnect then we should rebuild the URL based on the remaining maps rather than keep using the original URL and throwing away data for maps we already processed. In essence we should never reconnect and ask for what we already have. It seems like we should only need one-level of retry rather than two. I think we only need to retry at the top-level where we try to copy map outputs. If copying map outputs fails either because of a severed connection, a connection reset/timeout, or whatever and we have not spent too much time trying to recover from errors for the same map output then we retry. As soon as we successfully shuffle a map then we reset the time (because things are working and we're making progress) and move onto the next map output. I don't think we need to retry at this level and also at the socket connection level. The upper level retry will cause us to automatically retry the lower-level socket connection, unless I'm missing something. We probably should use Time.monotonicNow or something similar that won't break the desired delay behavior if the clock changes. "miliseconds" should be "milliseconds"
          Hide
          Junping Du added a comment -

          Thanks Jason Lowe for quickly feedback!

          If we get an error and need to reconnect then we should rebuild the URL based on the remaining maps rather than keep using the original URL and throwing away data for maps we already processed.

          That's very good point, will fix it in patch latter.

          It seems like we should only need one-level of retry rather than two. I think we only need to retry at the top-level where we try to copy map outputs. If copying map outputs fails either because of a severed connection, a connection reset/timeout, or whatever and we have not spent too much time trying to recover from errors for the same map output then we retry.

          One-level of retry sounds good. Actually, the current implementation is mostly a one-level retry, but just retry connection and copyMapOutput separately. The reason I choose to do two things separately here is because the retry connection has some IOExceptions (like things get failed in verifyConnection() is not because of NM restart) that we don't need to retry anymore. Wrap this kind of exceptions within retry may sounds unnecessary. Thoughts?

          We probably should use Time.monotonicNow or something similar that won't break the desired delay behavior if the clock changes. "miliseconds" should be "milliseconds"

          Also good points, will fix it in latter patch.

          Show
          Junping Du added a comment - Thanks Jason Lowe for quickly feedback! If we get an error and need to reconnect then we should rebuild the URL based on the remaining maps rather than keep using the original URL and throwing away data for maps we already processed. That's very good point, will fix it in patch latter. It seems like we should only need one-level of retry rather than two. I think we only need to retry at the top-level where we try to copy map outputs. If copying map outputs fails either because of a severed connection, a connection reset/timeout, or whatever and we have not spent too much time trying to recover from errors for the same map output then we retry. One-level of retry sounds good. Actually, the current implementation is mostly a one-level retry, but just retry connection and copyMapOutput separately. The reason I choose to do two things separately here is because the retry connection has some IOExceptions (like things get failed in verifyConnection() is not because of NM restart) that we don't need to retry anymore. Wrap this kind of exceptions within retry may sounds unnecessary. Thoughts? We probably should use Time.monotonicNow or something similar that won't break the desired delay behavior if the clock changes. "miliseconds" should be "milliseconds" Also good points, will fix it in latter patch.
          Hide
          Jason Lowe added a comment -

          The reason I choose to do two things separately here is because the retry connection has some IOExceptions (like things get failed in verifyConnection() is not because of NM restart) that we don't need to retry anymore. Wrap this kind of exceptions within retry may sounds unnecessary. Thoughts?

          Ah, yes good point. We shouldn't retry exceptions that we know aren't related to the NM bouncing, like secret hash verification failures.

          I was under the impression that the copyMapOutput retry could cause a reconnect which itself would have retries. If that's not the case then there's no issue with nested retries.

          Show
          Jason Lowe added a comment - The reason I choose to do two things separately here is because the retry connection has some IOExceptions (like things get failed in verifyConnection() is not because of NM restart) that we don't need to retry anymore. Wrap this kind of exceptions within retry may sounds unnecessary. Thoughts? Ah, yes good point. We shouldn't retry exceptions that we know aren't related to the NM bouncing, like secret hash verification failures. I was under the impression that the copyMapOutput retry could cause a reconnect which itself would have retries. If that's not the case then there's no issue with nested retries.
          Hide
          Junping Du added a comment -

          Thanks Jason Lowe for review! I just updated the patch with addressing most of your previous comments and add unit test. Please help to review it again, Thx!

          I was under the impression that the copyMapOutput retry could cause a reconnect which itself would have retries. If that's not the case then there's no issue with nested retries.

          If copyMapOutput throw exception during NM restart, then it will firstly go to reconnect with retry as in most cases the connect will get failed except we tolerant sometime to wait NM get recovered. We can also give up retry in connect, but the logic will be more complexity as something like following, which may not be necessary?

          while (...) {
            try {
               failedTasks = copyMapOutput(...);
            } catch (IOException e) {
               try {
                   connect(...);        
                } catch {
                  // do nothing, back to the loop.
                }
             }
          
          Show
          Junping Du added a comment - Thanks Jason Lowe for review! I just updated the patch with addressing most of your previous comments and add unit test. Please help to review it again, Thx! I was under the impression that the copyMapOutput retry could cause a reconnect which itself would have retries. If that's not the case then there's no issue with nested retries. If copyMapOutput throw exception during NM restart, then it will firstly go to reconnect with retry as in most cases the connect will get failed except we tolerant sometime to wait NM get recovered. We can also give up retry in connect, but the logic will be more complexity as something like following, which may not be necessary? while (...) { try { failedTasks = copyMapOutput(...); } catch (IOException e) { try { connect(...); } catch { // do nothing, back to the loop. } }
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12664422/MAPREDUCE-5891.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4825//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4825//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664422/MAPREDUCE-5891.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4825//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4825//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          Thanks for updating the patch, Junping. Comments:

          DEFAULT_SHUFFLE_FETCH_* should be in MRJobConfig

          openConnectionWithRetry should not ignore InterruptedException. Fetchers are shutdown by being interrupted, so I think minimally we should check for stopped==true if one occurs and act accordingly.

          We log a WARN when we can retry but only an INFO when we failed to read a map header and are not retrying. That seems backwards. Also the message logged when we can't retry is a lot more informative than the one when we can.

          We are retrying one more time when we're past the retry timeout which could result in a significantly longer time to discover fetch failures that aren't NM restart-related. This is also inconsistent with how openConnectionWithRetry behaves.

          Only the retry enabled property was added to mapred-default.xml. We should also add the other two properties with their defaults and appropriate descriptions for documentation.

          There should be a unit test to verify fetch errors can still be reported even with retry enabled, as it's important that we don't break the ability to recover from errors not related to NM restart.

          Nit: mapreduce.reduce.shuffle.fetch.interval-ms should be mapreduce.reduce.shuffle.fetch.retry.interval-ms to clearly indicate this is an interval only applicable for fetch retry. Similarly mapreduce.reduce.shuffle.fetch.timeout-ms should be mapreduce.reduce.shuffle.fetch.retry.timeout-ms.

          Nit: "which means it haven't retried yet." should be "which means it hasn't retried yet."

          Show
          Jason Lowe added a comment - Thanks for updating the patch, Junping. Comments: DEFAULT_SHUFFLE_FETCH_* should be in MRJobConfig openConnectionWithRetry should not ignore InterruptedException. Fetchers are shutdown by being interrupted, so I think minimally we should check for stopped==true if one occurs and act accordingly. We log a WARN when we can retry but only an INFO when we failed to read a map header and are not retrying. That seems backwards. Also the message logged when we can't retry is a lot more informative than the one when we can. We are retrying one more time when we're past the retry timeout which could result in a significantly longer time to discover fetch failures that aren't NM restart-related. This is also inconsistent with how openConnectionWithRetry behaves. Only the retry enabled property was added to mapred-default.xml. We should also add the other two properties with their defaults and appropriate descriptions for documentation. There should be a unit test to verify fetch errors can still be reported even with retry enabled, as it's important that we don't break the ability to recover from errors not related to NM restart. Nit: mapreduce.reduce.shuffle.fetch.interval-ms should be mapreduce.reduce.shuffle.fetch.retry.interval-ms to clearly indicate this is an interval only applicable for fetch retry. Similarly mapreduce.reduce.shuffle.fetch.timeout-ms should be mapreduce.reduce.shuffle.fetch.retry.timeout-ms. Nit: "which means it haven't retried yet." should be "which means it hasn't retried yet."
          Hide
          Junping Du added a comment -

          Thanks Jason Lowe for review and comments! In v2 patch, I addressed all your comments.

          We are retrying one more time when we're past the retry timeout which could result in a significantly longer time to discover fetch failures that aren't NM restart-related. This is also inconsistent with how openConnectionWithRetry behaves.

          Nice catch. Move timeout judgement inside of copyMapOutput to see if throw exception for retry (before timeout) or get failed (reach to or after timeout).

          Show
          Junping Du added a comment - Thanks Jason Lowe for review and comments! In v2 patch, I addressed all your comments. We are retrying one more time when we're past the retry timeout which could result in a significantly longer time to discover fetch failures that aren't NM restart-related. This is also inconsistent with how openConnectionWithRetry behaves. Nice catch. Move timeout judgement inside of copyMapOutput to see if throw exception for retry (before timeout) or get failed (reach to or after timeout).
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12664795/MAPREDUCE-5891-v2.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4829//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4829//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12664795/MAPREDUCE-5891-v2.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4829//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4829//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          Thanks for updating the patch! Comments:

          SHUFFLE_FETCH_TIMEOUT_MS = "mapreduce.reduce.shuffle.fetch.timeout-ms" but it should be "mapreduce.reduce.shuffle.fetch.retry.timeout-ms"

          openConnectionWithRetry calls abortConnect if stopped, but the one caller of this function does the same thing when it returns. Maybe openConnectionWithRetry should just return if stopped?

          Nit: The code block in copyMapOutput's catch of IOException is getting really long. It would be good to refactor some of this code into methods

          Minor nit: "get failed" should be "failed".

          openConnectionWithRetry is being called and retries even if fetch retry is disabled

          Shouldn't we be setting retryStartTime back to zero instead of endTime below? Otherwise the next error could timeout without any retry if the transfer before the error took longer than the timeout interval.

                // Refresh retryStartTime as map task make progress if retried before.
                if (retryStartTime != 0) {
                  retryStartTime = endTime;
                }
          

          Also wondering if we should reset it after each successful transfer (e.g.: after a successful header parse and successful shuffle)?

          Show
          Jason Lowe added a comment - Thanks for updating the patch! Comments: SHUFFLE_FETCH_TIMEOUT_MS = "mapreduce.reduce.shuffle.fetch.timeout-ms" but it should be "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" openConnectionWithRetry calls abortConnect if stopped, but the one caller of this function does the same thing when it returns. Maybe openConnectionWithRetry should just return if stopped? Nit: The code block in copyMapOutput's catch of IOException is getting really long. It would be good to refactor some of this code into methods Minor nit: "get failed" should be "failed". openConnectionWithRetry is being called and retries even if fetch retry is disabled Shouldn't we be setting retryStartTime back to zero instead of endTime below? Otherwise the next error could timeout without any retry if the transfer before the error took longer than the timeout interval. // Refresh retryStartTime as map task make progress if retried before. if (retryStartTime != 0) { retryStartTime = endTime; } Also wondering if we should reset it after each successful transfer (e.g.: after a successful header parse and successful shuffle)?
          Hide
          Junping Du added a comment -

          Thanks Jason Lowe for comments!

          SHUFFLE_FETCH_TIMEOUT_MS should be "mapreduce.reduce.shuffle.fetch.retry.timeout-ms"

          Nice catch, done.

          openConnectionWithRetry calls abortConnect if stopped, but the one caller of this function does the same thing when it returns. Maybe openConnectionWithRetry should just return if stopped?

          Yes. Even caller can return directly as caller from upper layer already address it. Fixed.

          Nit: The code block in copyMapOutput's catch of IOException is getting really long. It would be good to refactor some of this code into methods. Minor nit: "get failed" should be "failed".

          Done.

          openConnectionWithRetry is being called and retries even if fetch retry is disabled

          Good point, fixed.

          Shouldn't we be setting retryStartTime back to zero instead of endTime below?

          Also good one, fixed it.

          Also wondering if we should reset it after each successful transfer (e.g.: after a successful header parse and successful shuffle)?

          May not be necessary. If retryStartTime is not 0, which means this fetcher haven't successfully make any progress since last failure of getMapOutput, it should keep trying and wait time aggregation until timeout.

          Show
          Junping Du added a comment - Thanks Jason Lowe for comments! SHUFFLE_FETCH_TIMEOUT_MS should be "mapreduce.reduce.shuffle.fetch.retry.timeout-ms" Nice catch, done. openConnectionWithRetry calls abortConnect if stopped, but the one caller of this function does the same thing when it returns. Maybe openConnectionWithRetry should just return if stopped? Yes. Even caller can return directly as caller from upper layer already address it. Fixed. Nit: The code block in copyMapOutput's catch of IOException is getting really long. It would be good to refactor some of this code into methods. Minor nit: "get failed" should be "failed". Done. openConnectionWithRetry is being called and retries even if fetch retry is disabled Good point, fixed. Shouldn't we be setting retryStartTime back to zero instead of endTime below? Also good one, fixed it. Also wondering if we should reset it after each successful transfer (e.g.: after a successful header parse and successful shuffle)? May not be necessary. If retryStartTime is not 0, which means this fetcher haven't successfully make any progress since last failure of getMapOutput, it should keep trying and wait time aggregation until timeout.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12665542/MAPREDUCE-5891-v3.patch
          against trunk revision 270a271.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          -1 javac. The patch appears to cause the build to fail.

          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4838//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12665542/MAPREDUCE-5891-v3.patch against trunk revision 270a271. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. -1 javac . The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4838//console This message is automatically generated.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12665542/MAPREDUCE-5891-v3.patch
          against trunk revision 258c7d0.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4839//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4839//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12665542/MAPREDUCE-5891-v3.patch against trunk revision 258c7d0. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4839//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4839//console This message is automatically generated.
          Hide
          Ming Ma added a comment -

          Thanks, Junping, Jason for the useful patch.

          In the case slowstart is set to some small value, the reducer will fetch some mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is set to some old value due to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle NM host B restart?

          To make sure fetcher doesn't unnecessarily retry for the decommission scenario, it seems the assumption is we will have some sort of graceful decommission support so that during decommission process the fetcher will still be able to get mapper output. Is it true?

          If we get time to do YARN-1593, that will further reduce the chance of shuffle handler restart. Any opinion on that?

          Show
          Ming Ma added a comment - Thanks, Junping, Jason for the useful patch. In the case slowstart is set to some small value, the reducer will fetch some mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is set to some old value due to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle NM host B restart? To make sure fetcher doesn't unnecessarily retry for the decommission scenario, it seems the assumption is we will have some sort of graceful decommission support so that during decommission process the fetcher will still be able to get mapper output. Is it true? If we get time to do YARN-1593 , that will further reduce the chance of shuffle handler restart. Any opinion on that?
          Hide
          Jason Lowe added a comment -

          Thanks for updating the patch, Junping. Comments:

          I'm not sure the following log message adds that much value given the exception is going to be logged just a bit later. A different log message appears when retry is enabled, so I think this just adds to the log length without adding a lot of valuable information.

          +        if (!fetchRetryEnabled) {
          +           LOG.warn("Failed to connect to host: " + url + 
          +                    " when retry is not enabled");
          +           throw e;
          +        }
          

          setupConnectionsWithRetry is now inconsistent when it comes to calling abortConnect() when stopped is true.

          Nit: The following code should simply be retryStartTime = 0;

          +      if (retryStartTime != 0) {
          +        retryStartTime = 0;
          +      }
          

          And to address some of Ming's comments:

          Is it possible Fetcher.retryStartTime is set to some old value due to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle NM host B restart?

          Yes, that is totally possible. Nice catch, Ming!
          @Junping, the code needs to set retryStartTime to zero at the beginning of copyFromHost as well to make sure remnants from previous failures don't leak into new, fresh attempts.

          To make sure fetcher doesn't unnecessarily retry for the decommission scenario, it seems the assumption is we will have some sort of graceful decommission support so that during decommission process the fetcher will still be able to get mapper output.

          Yes, see YARN-914. If the decommission still takes place even though the application is still running and output still needs to be fetched from that node, the RM will inform the AM of the node being removed from the cluster. The AM will then inform the reducers that the map outputs on that node are obsolete and will reschedule map tasks on other nodes rather than have the reducers keep trying against a decomm'd node.

          If we get time to do YARN-1593, that will further reduce the chance of shuffle handler restart. Any opinion on that?

          Yes that would be quite nice, although that seems like a very significant feature to implement in the 2.6 timeframe. It creates new security, reacquisition, and failure issues, and while it does significantly reduce the scenarios where the ShuffleHandler will need to be restarted it doesn't, by itself, preclude the need to do so. For example, if the ShuffleHandler has a bugfix that needs to be deployed we either need the ability for MapReduce to request different versions of an aux service and have multiple versions running simultaneously or we need to restart the ShuffleHandler service. The latter requires workarounds like this JIRA to avoid potential fetch failure storms when the ShuffleHandler service is down temporarily.

          Show
          Jason Lowe added a comment - Thanks for updating the patch, Junping. Comments: I'm not sure the following log message adds that much value given the exception is going to be logged just a bit later. A different log message appears when retry is enabled, so I think this just adds to the log length without adding a lot of valuable information. + if (!fetchRetryEnabled) { + LOG.warn( "Failed to connect to host: " + url + + " when retry is not enabled" ); + throw e; + } setupConnectionsWithRetry is now inconsistent when it comes to calling abortConnect() when stopped is true. Nit: The following code should simply be retryStartTime = 0; + if (retryStartTime != 0) { + retryStartTime = 0; + } And to address some of Ming's comments: Is it possible Fetcher.retryStartTime is set to some old value due to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle NM host B restart? Yes, that is totally possible. Nice catch, Ming! @Junping, the code needs to set retryStartTime to zero at the beginning of copyFromHost as well to make sure remnants from previous failures don't leak into new, fresh attempts. To make sure fetcher doesn't unnecessarily retry for the decommission scenario, it seems the assumption is we will have some sort of graceful decommission support so that during decommission process the fetcher will still be able to get mapper output. Yes, see YARN-914 . If the decommission still takes place even though the application is still running and output still needs to be fetched from that node, the RM will inform the AM of the node being removed from the cluster. The AM will then inform the reducers that the map outputs on that node are obsolete and will reschedule map tasks on other nodes rather than have the reducers keep trying against a decomm'd node. If we get time to do YARN-1593 , that will further reduce the chance of shuffle handler restart. Any opinion on that? Yes that would be quite nice, although that seems like a very significant feature to implement in the 2.6 timeframe. It creates new security, reacquisition, and failure issues, and while it does significantly reduce the scenarios where the ShuffleHandler will need to be restarted it doesn't, by itself, preclude the need to do so. For example, if the ShuffleHandler has a bugfix that needs to be deployed we either need the ability for MapReduce to request different versions of an aux service and have multiple versions running simultaneously or we need to restart the ShuffleHandler service. The latter requires workarounds like this JIRA to avoid potential fetch failure storms when the ShuffleHandler service is down temporarily.
          Hide
          Junping Du added a comment -

          Thanks for comments, Ming Ma and Jason Lowe!

          In the case slowstart is set to some small value, the reducer will fetch some mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is set to some old value due to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle NM host B restart?

          Nice catch! Fixed as Jason's suggestion below.

          so I think this just adds to the log length without adding a lot of valuable information

          Agree. remove the log.

          Nit: The following code should simply be retryStartTime = 0;

          Fixed.

          setupConnectionsWithRetry is now inconsistent when it comes to calling abortConnect() when stopped is true.

          Good point. Fixed.

          Also, agree above comments from Jason Lowe on YARN-914 and YARN-1593.

          Show
          Junping Du added a comment - Thanks for comments, Ming Ma and Jason Lowe ! In the case slowstart is set to some small value, the reducer will fetch some mapper output and wait for the rest. Is it possible Fetcher.retryStartTime is set to some old value due to early NM host A restart, and thus mark fetcher retry timed out when it later tries to handle NM host B restart? Nice catch! Fixed as Jason's suggestion below. so I think this just adds to the log length without adding a lot of valuable information Agree. remove the log. Nit: The following code should simply be retryStartTime = 0; Fixed. setupConnectionsWithRetry is now inconsistent when it comes to calling abortConnect() when stopped is true. Good point. Fixed. Also, agree above comments from Jason Lowe on YARN-914 and YARN-1593 .
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12666395/MAPREDUCE-5891-v4.patch
          against trunk revision 3a0142b.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4846//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4846//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12666395/MAPREDUCE-5891-v4.patch against trunk revision 3a0142b. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4846//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4846//console This message is automatically generated.
          Hide
          Junping Du added a comment -

          Hi Jason Lowe and Ming Ma, any additional comments on latest patch?

          Show
          Junping Du added a comment - Hi Jason Lowe and Ming Ma , any additional comments on latest patch?
          Hide
          Ming Ma added a comment -

          Thanks, Junping. Regarding the default value, mapreduce.reduce.shuffle.fetch.retry.enabled is set to true by default; while NM recovery is set to false by default. That means by default, fetcher will retry even though shufflehandler won't be able to serve mapper outputs after restart. It doesn't seem like a big deal. Just want to call out if that is intentional. Do we foresee other scenarios where fetch retry will be useful? If not, reducers can ask YARN if NM recovery is enabled or reducers can ask shufflehandler if recovery is enabled without defining this retry property.

          Show
          Ming Ma added a comment - Thanks, Junping. Regarding the default value, mapreduce.reduce.shuffle.fetch.retry.enabled is set to true by default; while NM recovery is set to false by default. That means by default, fetcher will retry even though shufflehandler won't be able to serve mapper outputs after restart. It doesn't seem like a big deal. Just want to call out if that is intentional. Do we foresee other scenarios where fetch retry will be useful? If not, reducers can ask YARN if NM recovery is enabled or reducers can ask shufflehandler if recovery is enabled without defining this retry property.
          Hide
          Jason Lowe added a comment -

          Thanks for updating the patch, Junping, and sorry for the delay in re-review. The fixes all look fine.

          I agree with Ming that we should be consistent about the default state of this feature and NM restart, although I'm not a fan of adding a YARN API to query NM restart. Task containers currently don't talk with the NM, and IMHO this is not a good enough reason to change that. I'm OK with adding it to the shuffle protocol if we can do it in a backwards-compatible way, although I don't know offhand how that would be accomplished. Another approach is to try to tie the two properties together and have the default value of mapreduce.reduce.shuffle.fetch.retry.enabled in mapred-default.xml be ${yarn.nodemanager.recovery.enabled}, so they could still be set independently but by default the NM restart setting drives the fetch retry setting.

          Show
          Jason Lowe added a comment - Thanks for updating the patch, Junping, and sorry for the delay in re-review. The fixes all look fine. I agree with Ming that we should be consistent about the default state of this feature and NM restart, although I'm not a fan of adding a YARN API to query NM restart. Task containers currently don't talk with the NM, and IMHO this is not a good enough reason to change that. I'm OK with adding it to the shuffle protocol if we can do it in a backwards-compatible way, although I don't know offhand how that would be accomplished. Another approach is to try to tie the two properties together and have the default value of mapreduce.reduce.shuffle.fetch.retry.enabled in mapred-default.xml be ${yarn.nodemanager.recovery.enabled}, so they could still be set independently but by default the NM restart setting drives the fetch retry setting.
          Hide
          Ming Ma added a comment -

          The patch looks good. I like Jason's idea to have mapreduce.reduce.shuffle.fetch.retry.enabled use $

          {yarn.nodemanager.recovery.enabled}

          as default value. As for the other approaches,

          a) dynamic MR to YARN query, given NM recovery flag is a global cluster level setting ( although it is possible to config it on per NM basis ), can we derive the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission time from some YARN API call to RM?

          b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http header via property key names. So if we add a new property to indicate if recovery is supported and continue to keep the same http "version" property, new version of fetcher might be able to work with old version of shufflehandler, and vise versa.

          Show
          Ming Ma added a comment - The patch looks good. I like Jason's idea to have mapreduce.reduce.shuffle.fetch.retry.enabled use $ {yarn.nodemanager.recovery.enabled} as default value. As for the other approaches, a) dynamic MR to YARN query, given NM recovery flag is a global cluster level setting ( although it is possible to config it on per NM basis ), can we derive the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission time from some YARN API call to RM? b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http header via property key names. So if we add a new property to indicate if recovery is supported and continue to keep the same http "version" property, new version of fetcher might be able to work with old version of shufflehandler, and vise versa.
          Hide
          Jason Lowe added a comment -

          a) dynamic MR to YARN query, given NM recovery flag is a global cluster level setting ( although it is possible to config it on per NM basis ), can we derive the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission time from some YARN API call to RM?

          The RM is unaware of whether the NM supports work-preserving restart, and I'd rather not add that coupling just for this.

          b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http header via property key names. So if we add a new property to indicate if recovery is supported and continue to keep the same http "version" property, new version of fetcher might be able to work with old version of shufflehandler, and vise versa.

          True, we could add a new HTTP header that new Fetchers could query.

          Show
          Jason Lowe added a comment - a) dynamic MR to YARN query, given NM recovery flag is a global cluster level setting ( although it is possible to config it on per NM basis ), can we derive the value of mapreduce.reduce.shuffle.fetch.retry.enabled at job submission time from some YARN API call to RM? The RM is unaware of whether the NM supports work-preserving restart, and I'd rather not add that coupling just for this. b) shuffle protocol change. It seems Fetcher and ShuffleHandler check http header via property key names. So if we add a new property to indicate if recovery is supported and continue to keep the same http "version" property, new version of fetcher might be able to work with old version of shufflehandler, and vise versa. True, we could add a new HTTP header that new Fetchers could query.
          Hide
          Junping Du added a comment -

          Sorry for coming a little late as busy in travel recently. Will deliver a quick fix for default value - prefer to use $

          {yarn.nodemanager.recovery.enabled}

          as default as this seems to be the easiest way for short term. Thanks Jason Lowe and Ming Ma for review and good comments here!

          Show
          Junping Du added a comment - Sorry for coming a little late as busy in travel recently. Will deliver a quick fix for default value - prefer to use $ {yarn.nodemanager.recovery.enabled} as default as this seems to be the easiest way for short term. Thanks Jason Lowe and Ming Ma for review and good comments here!
          Hide
          Junping Du added a comment -

          Fix the default value as discussion above in v5 patch.

          Show
          Junping Du added a comment - Fix the default value as discussion above in v5 patch.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12669470/MAPREDUCE-5891-v5.patch
          against trunk revision ea4e2e8.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4890//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4890//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669470/MAPREDUCE-5891-v5.patch against trunk revision ea4e2e8. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4890//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4890//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          Thanks for updating the patch, Junping.

          The property in mapred-default.xml should be updated to use ${yarn.nodemanager.recovery.enabled}, otherwise in practice we'll never fallback to the code default that tries to lookup from the NM property.

          Nit: huffleFetchEnabledDefault should be shuffleFetchEnabledDefault

          Ming Ma do you have additional comments or concerns?

          Show
          Jason Lowe added a comment - Thanks for updating the patch, Junping. The property in mapred-default.xml should be updated to use ${yarn.nodemanager.recovery.enabled}, otherwise in practice we'll never fallback to the code default that tries to lookup from the NM property. Nit: huffleFetchEnabledDefault should be shuffleFetchEnabledDefault Ming Ma do you have additional comments or concerns?
          Hide
          Junping Du added a comment -

          Thanks Jason Lowe for reviewing the patch! Fix it as your comments above in v6 patch.

          Show
          Junping Du added a comment - Thanks Jason Lowe for reviewing the patch! Fix it as your comments above in v6 patch.
          Hide
          Ming Ma added a comment -

          Junping, Jason, the patch looks good to me.

          Show
          Ming Ma added a comment - Junping, Jason, the patch looks good to me.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12669804/MAPREDUCE-5891-v6.patch
          against trunk revision 1cf3198.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4895//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4895//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669804/MAPREDUCE-5891-v6.patch against trunk revision 1cf3198. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4895//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4895//console This message is automatically generated.
          Hide
          Jason Lowe added a comment -

          +1 lgtm. Committing this.

          Show
          Jason Lowe added a comment - +1 lgtm. Committing this.
          Hide
          Jason Lowe added a comment -

          Thanks to Junping for the contribution and to Ming for additional review! I committed this to trunk and branch-2.

          Show
          Jason Lowe added a comment - Thanks to Junping for the contribution and to Ming for additional review! I committed this to trunk and branch-2.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #685 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/685/)
          MAPREDUCE-5891. Improved shuffle error handling across NM restarts. Contributed by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • hadoop-mapreduce-project/CHANGES.txt
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #685 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/685/ ) MAPREDUCE-5891 . Improved shuffle error handling across NM restarts. Contributed by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml hadoop-mapreduce-project/CHANGES.txt
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1901 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1901/)
          MAPREDUCE-5891. Improved shuffle error handling across NM restarts. Contributed by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • hadoop-mapreduce-project/CHANGES.txt
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1901 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1901/ ) MAPREDUCE-5891 . Improved shuffle error handling across NM restarts. Contributed by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java hadoop-mapreduce-project/CHANGES.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #1876 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1876/)
          MAPREDUCE-5891. Improved shuffle error handling across NM restarts. Contributed by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e)

          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java
          • hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java
          • hadoop-mapreduce-project/CHANGES.txt
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #1876 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1876/ ) MAPREDUCE-5891 . Improved shuffle error handling across NM restarts. Contributed by Junping Du (jlowe: rev 2c3da25fd718b3a9c1ed67f05b577975ae613f4e) hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/resources/mapred-default.xml hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/MRJobConfig.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/ShuffleSchedulerImpl.java hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Fetcher.java hadoop-mapreduce-project/CHANGES.txt

            People

            • Assignee:
              Junping Du
              Reporter:
              Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development