Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: 2.1.0-beta
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      The MapReduce AM is considering a forcibly preempted container as FAILED, while I think it should be considered as KILLED (i.e., not count against the maximum number of failures).

      1. MR-5848.patch
        17 kB
        Subru Krishnan
      2. MR-5848.patch
        2 kB
        Subru Krishnan
      3. YARN-1958.patch
        1 kB
        Carlo Curino

        Issue Links

          Activity

          Hide
          Subru Krishnan added a comment -

          Marking it as fixed as it is subsumed by the later MAPREDUCE-5900.

          Show
          Subru Krishnan added a comment - Marking it as fixed as it is subsumed by the later MAPREDUCE-5900 .
          Hide
          Ashwin Shankar added a comment -

          Committers,
          can we please commit this if its looking good ?
          I tried out this patch and it seems to work well.
          Without this,jobs fail due to preemption after mapreduce.(map,reduce).maxattempts limit is reached.
          This is also blocking YARN-1961.

          Show
          Ashwin Shankar added a comment - Committers, can we please commit this if its looking good ? I tried out this patch and it seems to work well. Without this,jobs fail due to preemption after mapreduce.(map,reduce).maxattempts limit is reached. This is also blocking YARN-1961 .
          Hide
          Carlo Curino added a comment -

          Mayank, I think you are right the behavior is equivalent at the moment. The advantage of tracking TA_PREEMPTED explicitly is that in the future we will be able to have different behaviors for preempted or killed tasks,
          without having to parse the diagnostic message. This seems to ease future work on this, so I would lean towards keeping the patch with this enrichement of the state machine unless something else is problematic with it.

          Show
          Carlo Curino added a comment - Mayank, I think you are right the behavior is equivalent at the moment. The advantage of tracking TA_PREEMPTED explicitly is that in the future we will be able to have different behaviors for preempted or killed tasks, without having to parse the diagnostic message. This seems to ease future work on this, so I would lean towards keeping the patch with this enrichement of the state machine unless something else is problematic with it.
          Hide
          Mayank Bansal added a comment -

          I think we don't need TA_PREEMPTED , TA_KILL is enough for this purpose.
          The diagnostics message is coming from RM in case of preemption and thats been populated in TA_KILL as well.
          I just tested without adding TA_PREEMPTED and it works as expected.

          Thoughts?

          Thanks,
          Mayank

          Show
          Mayank Bansal added a comment - I think we don't need TA_PREEMPTED , TA_KILL is enough for this purpose. The diagnostics message is coming from RM in case of preemption and thats been populated in TA_KILL as well. I just tested without adding TA_PREEMPTED and it works as expected. Thoughts? Thanks, Mayank
          Hide
          Carlo Curino added a comment -

          We have run this on a test cluster for few thousands of jobs with heavy preemption going on, and seems fine.

          Show
          Carlo Curino added a comment - We have run this on a test cluster for few thousands of jobs with heavy preemption going on, and seems fine.
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12643673/MR-5848.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4588//testReport/
          Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4588//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643673/MR-5848.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app. +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4588//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4588//console This message is automatically generated.
          Hide
          Subru Krishnan added a comment -

          Updated patch after testing and with more unit tests

          Show
          Subru Krishnan added a comment - Updated patch after testing and with more unit tests
          Hide
          Carlo Curino added a comment -

          Detected some issue while stress testing... investigating now.

          Show
          Carlo Curino added a comment - Detected some issue while stress testing... investigating now.
          Hide
          Subru Krishnan added a comment -

          Jason Lowe Thanks for the feedback. Uploading a new patch addressing your comments.

          Show
          Subru Krishnan added a comment - Jason Lowe Thanks for the feedback. Uploading a new patch addressing your comments.
          Hide
          Jason Lowe added a comment - - edited

          Sure, I think that's a reasonable approach. It's definitely an incremental improvement from what we have today.

          The kill event should have a useful diagnostic message attached to it so users can tell that it was preempted, i.e.: instead of a raw TA_KILL event it should send a TaskAttemptKillEvent with a diagnostic. Otherwise I think users will be confused as to why the task was killed (e.g.: was it preempted vs. aborted?) and may have to dig through a rather large AM log to sort that out.

          It'd also be nice to add a unit test in TestRMContainerAllocator.

          Show
          Jason Lowe added a comment - - edited Sure, I think that's a reasonable approach. It's definitely an incremental improvement from what we have today. The kill event should have a useful diagnostic message attached to it so users can tell that it was preempted, i.e.: instead of a raw TA_KILL event it should send a TaskAttemptKillEvent with a diagnostic. Otherwise I think users will be confused as to why the task was killed (e.g.: was it preempted vs. aborted?) and may have to dig through a rather large AM log to sort that out. It'd also be nice to add a unit test in TestRMContainerAllocator.
          Hide
          Subru Krishnan added a comment -

          Jason Lowe, thanks for pointing out that the patch is necessary but not sufficient for the AM to make the decision about tasks being KILLED rather than FAILED.

          We have been running under heavy loads with pre-emption enabled using GridMix and the patch has significantly reduced the number of tasks reported incorrectly as FAILED, in fact we haven't seen the SIGTERM issue at all. So we were wondering if we can have a separate JIRA to track the handling of SIGTERM scenario while this patch gets pushed as this is definitely required?

          Show
          Subru Krishnan added a comment - Jason Lowe , thanks for pointing out that the patch is necessary but not sufficient for the AM to make the decision about tasks being KILLED rather than FAILED. We have been running under heavy loads with pre-emption enabled using GridMix and the patch has significantly reduced the number of tasks reported incorrectly as FAILED, in fact we haven't seen the SIGTERM issue at all. So we were wondering if we can have a separate JIRA to track the handling of SIGTERM scenario while this patch gets pushed as this is definitely required?
          Hide
          Jason Lowe added a comment - - edited

          On the positive side, the AM should know the containers was on the short-list to be killed from previous preemption messages it received so maybe it could count a failure of a container "doomed" by preemption as a kill? Or simply postpone the decision on FAIL/KILL. Not sure...

          Yes, the AM should definitely know, and I think the change in the patch is good just not sufficient.

          As for postponing the decision, we may have to do just that. To resolve the general case of SIGTERM potentially causing failures in the task which should be ignored in light of the kill, the AM may need to wait until it receives the container status from the RM to distinguish the cases. Haven't thought through all of the ramifications of doing that, and I suspect there could be some long delays for some corner cases (e.g.: node fails as task fails, takes the RM a while to expire the node in order to send the container status).

          Show
          Jason Lowe added a comment - - edited On the positive side, the AM should know the containers was on the short-list to be killed from previous preemption messages it received so maybe it could count a failure of a container "doomed" by preemption as a kill? Or simply postpone the decision on FAIL/KILL. Not sure... Yes, the AM should definitely know, and I think the change in the patch is good just not sufficient. As for postponing the decision, we may have to do just that. To resolve the general case of SIGTERM potentially causing failures in the task which should be ignored in light of the kill, the AM may need to wait until it receives the container status from the RM to distinguish the cases. Haven't thought through all of the ramifications of doing that, and I suspect there could be some long delays for some corner cases (e.g.: node fails as task fails, takes the RM a while to expire the node in order to send the container status).
          Hide
          Jason Lowe added a comment -

          Yes, FileSystem is doing this. See the ClientFinalizer in FileSystem. One example where we've seen this race occur is when tasks start running outside their memory boundaries and the NM kills them. Sometimes the task kills cleanly and the AM gets the useful container status added by the NM explaining why the container was killed. Other times the task ends up failing first due to actions caused by the SIGTERM processing and the useful container status arrives too late.

          Show
          Jason Lowe added a comment - Yes, FileSystem is doing this. See the ClientFinalizer in FileSystem. One example where we've seen this race occur is when tasks start running outside their memory boundaries and the NM kills them. Sometimes the task kills cleanly and the AM gets the useful container status added by the NM explaining why the container was killed. Other times the task ends up failing first due to actions caused by the SIGTERM processing and the useful container status arrives too late.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          We should also expose a task-attempt-diagnostic which says the container got preempted, that will help a bit.

          When a task attempt receives SIGTERM from the NM it causes the FileSystem to close via the shutdown hook and often causes exceptions within the task.

          I haven't seen this in YarnChild. Is FileSystem itself installing this?

          Show
          Vinod Kumar Vavilapalli added a comment - We should also expose a task-attempt-diagnostic which says the container got preempted, that will help a bit. When a task attempt receives SIGTERM from the NM it causes the FileSystem to close via the shutdown hook and often causes exceptions within the task. I haven't seen this in YarnChild. Is FileSystem itself installing this?
          Hide
          Carlo Curino added a comment -

          Jason, thanks for moving to MR.. I mislabel it.

          Regarding your comment, I was suspecting that something like that was going one (hence my doubting the patch was enough).

          On the positive side, the AM should know the containers was on the short-list to be killed from previous preemption messages it received
          so maybe it could count a failure of a container "doomed" by preemption as a kill? Or simply postpone the decision on FAIL/KILL. Not sure...
          I don't have time to look at this now, but if you provide a fix I am happy to help you validate it (for different reasons we are testing preemption
          under rather extreme scenarios).

          Show
          Carlo Curino added a comment - Jason, thanks for moving to MR.. I mislabel it. Regarding your comment, I was suspecting that something like that was going one (hence my doubting the patch was enough). On the positive side, the AM should know the containers was on the short-list to be killed from previous preemption messages it received so maybe it could count a failure of a container "doomed" by preemption as a kill? Or simply postpone the decision on FAIL/KILL. Not sure... I don't have time to look at this now, but if you provide a fix I am happy to help you validate it (for different reasons we are testing preemption under rather extreme scenarios).
          Hide
          Jason Lowe added a comment -

          Moving this to MAPREDUCE since it's an issue with the MaoReduce AM and not YARN.

          Note that the small patch may not be sufficient to completely solve the issue of preempted tasks being interpreted as FAILED rather than KILLED. Due to races between the AM heartbeat to the RM and the AM receiving task status updates via the task umbilical the AM may still think that a task attempt failed even though it was preempted. When a task attempt receives SIGTERM from the NM it causes the FileSystem to close via the shutdown hook and often causes exceptions within the task. Those exceptions are reported as task failure via the task umbilical. If that status arrives at the AM before the AM receives the preempted container status from the RM then the AM counts it as a task failure rather than a task preempt.

          Show
          Jason Lowe added a comment - Moving this to MAPREDUCE since it's an issue with the MaoReduce AM and not YARN. Note that the small patch may not be sufficient to completely solve the issue of preempted tasks being interpreted as FAILED rather than KILLED. Due to races between the AM heartbeat to the RM and the AM receiving task status updates via the task umbilical the AM may still think that a task attempt failed even though it was preempted. When a task attempt receives SIGTERM from the NM it causes the FileSystem to close via the shutdown hook and often causes exceptions within the task. Those exceptions are reported as task failure via the task umbilical. If that status arrives at the AM before the AM receives the preempted container status from the RM then the AM counts it as a task failure rather than a task preempt.
          Hide
          Carlo Curino added a comment -

          I am attaching a very small fix for the AM. I am not 100% sure this is all it takes, but this seems needed at the very least.

          Show
          Carlo Curino added a comment - I am attaching a very small fix for the AM. I am not 100% sure this is all it takes, but this seems needed at the very least.
          Hide
          Carlo Curino added a comment -

          one-line patch fixing how the AM interprets a container forcibly preempted

          Show
          Carlo Curino added a comment - one-line patch fixing how the AM interprets a container forcibly preempted

            People

            • Assignee:
              Subru Krishnan
              Reporter:
              Carlo Curino
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development