Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3688

Need better Error message if AM is killed/throws exception

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.23.1
    • Fix Version/s: 0.23.2
    • Component/s: mr-am, mrv2
    • Labels:
      None

      Description

      We need better error messages in the UI if the AM gets killed or throws an Exception.

      If the following error gets thrown:
      java.lang.NumberFormatException: For input string: "9223372036854775807l" // last char is an L

      then the UI should say this exception. Instead I get the following:

      Application application_1326504761991_0018 failed 1 times due to AM Container for appattempt_1326504761991_0018_000001
      exited with exitCode: 1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException

      1. mapreduce-3688-h0.23-v02.patch
        3 kB
        Koji Noguchi
      2. mapreduce-3688-h0.23-v01.patch
        2 kB
        Koji Noguchi

        Issue Links

          Activity

          Hide
          Steve Loughran added a comment -

          having a quick look at this patch, if its trying to print the exception details to stdout, then the e.toString() value should be printed too -the Throwable.printStackTrace() does the trace but not the exception body

          Show
          Steve Loughran added a comment - having a quick look at this patch, if its trying to print the exception details to stdout, then the e.toString() value should be printed too -the Throwable.printStackTrace() does the trace but not the exception body
          Hide
          Vinod Kumar Vavilapalli added a comment -

          Ravi, I reopened MAPREDUCE-3949 and assigned it to you, can you please add info and perhaps work on it?

          Show
          Vinod Kumar Vavilapalli added a comment - Ravi, I reopened MAPREDUCE-3949 and assigned it to you, can you please add info and perhaps work on it?
          Hide
          Sandy Ryza added a comment -

          Filed YARN-499 for adding logs to diagnostics

          Show
          Sandy Ryza added a comment - Filed YARN-499 for adding logs to diagnostics
          Hide
          Sandy Ryza added a comment -

          Cool, thanks Ravi

          Show
          Sandy Ryza added a comment - Cool, thanks Ravi
          Hide
          Ravi Prakash added a comment -

          Does me working on picking up the logs and you working on the over-resource-limits message work for you?

          Sure! I'll file a separate task / JIRA for it, and please feel free to continue working on this one.

          Show
          Ravi Prakash added a comment - Does me working on picking up the logs and you working on the over-resource-limits message work for you? Sure! I'll file a separate task / JIRA for it, and please feel free to continue working on this one.
          Hide
          Sandy Ryza added a comment -

          Hi Ravi,

          I haven't come across the issue that you mentioned, i.e. I've gotten the proper diagnostic message when the NM kills a container for going over resource limits, but my testing has only been limited. Sounds like some sort of bug with the NM state machine?

          The part I've been looking into is related to Koji's work, making it that any errors that containers spit out to stdout/stderr on startup get added to the diagnostics.

          As the focus of this JIRA has gone between a few related but separate issues, my opinion is at this point it makes most sense to file new JIRAs (or subtasks?) for the specific changes we want to make. Does me working on picking up the logs and you working on the over-resource-limits message work for you?

          Show
          Sandy Ryza added a comment - Hi Ravi, I haven't come across the issue that you mentioned, i.e. I've gotten the proper diagnostic message when the NM kills a container for going over resource limits, but my testing has only been limited. Sounds like some sort of bug with the NM state machine? The part I've been looking into is related to Koji's work, making it that any errors that containers spit out to stdout/stderr on startup get added to the diagnostics. As the focus of this JIRA has gone between a few related but separate issues, my opinion is at this point it makes most sense to file new JIRAs (or subtasks?) for the specific changes we want to make. Does me working on picking up the logs and you working on the over-resource-limits message work for you?
          Hide
          Ravi Prakash added a comment -

          Hi Sandy! Did you think about this? Otherwise I'll be happy to take this back and work on it in this week.

          From my testing on trunk, I notice that even for the case where the AM goes over container limits (which I trigger with -Dyarn.app.mapreduce.am.resource.mb=512 -Dyarn.app.mapreduce.am.command-opts="-Xmx3500m" on a sleep job), sometimes the error is propagated back and sometimes its not. Can you please corroborate this? When State == FinalState == FAILED, the error is propagated back. However about half the times, State == FINISHED and FinalState == KILLED, in which case there is no message anywhere to help me. Not in the diagnostics, and there are no logs.

          Show
          Ravi Prakash added a comment - Hi Sandy! Did you think about this? Otherwise I'll be happy to take this back and work on it in this week. From my testing on trunk, I notice that even for the case where the AM goes over container limits (which I trigger with -Dyarn.app.mapreduce.am.resource.mb=512 -Dyarn.app.mapreduce.am.command-opts="-Xmx3500m" on a sleep job), sometimes the error is propagated back and sometimes its not. Can you please corroborate this? When State == FinalState == FAILED, the error is propagated back. However about half the times, State == FINISHED and FinalState == KILLED, in which case there is no message anywhere to help me. Not in the diagnostics, and there are no logs.
          Hide
          Eli Reisman added a comment -

          Love it! Go Koji!!!

          Sandy was right to identify all the potential pain points to check into. In the end, we need to figure out sensible, clear exit codes and error msgs to propagate. The empty diagnostic messages just make everything slower and more difficult. Often this will be people's first impression of working with YARN (getting their job set up and running to see what happens!) so we don't want that.

          Show
          Eli Reisman added a comment - Love it! Go Koji!!! Sandy was right to identify all the potential pain points to check into. In the end, we need to figure out sensible, clear exit codes and error msgs to propagate. The empty diagnostic messages just make everything slower and more difficult. Often this will be people's first impression of working with YARN (getting their job set up and running to see what happens!) so we don't want that.
          Hide
          Koji Noguchi added a comment -

          Another common error is ApplicationMaster going out of memory when number of tasks are large. Adding error message to stdout so that OOM would show.

          Diagnostics: Application application_1362579399138_0003 failed 1 times due to AM Container for appattempt_1362579399138_0003_000001 exited with exitCode: 255 due to: Error starting MRAppMaster: java.lang.OutOfMemoryError: Java heap space at

          Forgot to mention but having these messages to UI also means it would show up on jobclient(console) side as well.

          Show
          Koji Noguchi added a comment - Another common error is ApplicationMaster going out of memory when number of tasks are large. Adding error message to stdout so that OOM would show. Diagnostics: Application application_1362579399138_0003 failed 1 times due to AM Container for appattempt_1362579399138_0003_000001 exited with exitCode: 255 due to: Error starting MRAppMaster: java.lang.OutOfMemoryError: Java heap space at Forgot to mention but having these messages to UI also means it would show up on jobclient(console) side as well.
          Hide
          Koji Noguchi added a comment -

          This has been a pain for our users as well.

          I don't think this patch will fly well with the reviewers, but maybe it'll help move the discussion forward.

          I didn't see a good way of communicating the error message to the caller so decided to sacrifice the stdout that current MRAppMaster does not use.

          After the patch, webUI would show

          Diagnostics: Application application_1362527487477_0005 failed 1 times due to AM Container for appattempt_1362527487477_0005_000001 exited with exitCode: 1 due to: Error starting MRAppMaster: org.apache.hadoop.yarn.YarnException: java.io.IOException: Split metadata size exceeded 20. Aborting job job_1362527487477_0005 at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1290) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1146) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1118) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:382) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:823) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:121) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1094) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:998) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1273) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1221) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1269) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1226) Caused by: java.io.IOException: Split metadata size exceeded 20. Aborting job job_1362527487477_0005 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:53) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1285) ... 16 more .Failing this attempt.. Failing the application.

          (This patch is based on 0.23)

          Show
          Koji Noguchi added a comment - This has been a pain for our users as well. I don't think this patch will fly well with the reviewers, but maybe it'll help move the discussion forward. I didn't see a good way of communicating the error message to the caller so decided to sacrifice the stdout that current MRAppMaster does not use. After the patch, webUI would show Diagnostics: Application application_1362527487477_0005 failed 1 times due to AM Container for appattempt_1362527487477_0005_000001 exited with exitCode: 1 due to: Error starting MRAppMaster: org.apache.hadoop.yarn.YarnException: java.io.IOException: Split metadata size exceeded 20. Aborting job job_1362527487477_0005 at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1290) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1146) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1118) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:382) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:823) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:121) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1094) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.start(MRAppMaster.java:998) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1273) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1221) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1269) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1226) Caused by: java.io.IOException: Split metadata size exceeded 20. Aborting job job_1362527487477_0005 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:53) at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1285) ... 16 more .Failing this attempt.. Failing the application. (This patch is based on 0.23)
          Hide
          Ravi Prakash added a comment -

          Ravi Prakash, mind if I pick this up?

          Not at all. Sorry for sitting on this soooo long. I thought maybe I can make it hatch. =D

          Show
          Ravi Prakash added a comment - Ravi Prakash, mind if I pick this up? Not at all. Sorry for sitting on this soooo long. I thought maybe I can make it hatch. =D
          Hide
          Jason Lowe added a comment -

          The original intent of this JIRA is a MapReduce thing. The AM encounters a configuration error that causes it to abort before even trying to register with the RM. If the AM fails to register with the RM, there's no hope of the RM's web UI giving any useful diagnostic since the only thing that knows the diagnostic is the AM. The only thing it can do is point to AM logs (which is already fixed by things like YARN-236) because the container just up and exited.

          It'd be nice to handle all the YARN-side failures like localization, launching, etc., but that should be handled in a separate YARN JIRA. Most things that happen once the MRAppMaster gets going could be handled better by the AM itself. Perfect example is in the description or things like MAPREDUCE-4937, where the AM could do a much better job informing the user of the error by registering with the RM then unregistering with failure and a useful diagnostic and also generating job history so it's in the "usual places" for users.

          Show
          Jason Lowe added a comment - The original intent of this JIRA is a MapReduce thing. The AM encounters a configuration error that causes it to abort before even trying to register with the RM. If the AM fails to register with the RM, there's no hope of the RM's web UI giving any useful diagnostic since the only thing that knows the diagnostic is the AM. The only thing it can do is point to AM logs (which is already fixed by things like YARN-236 ) because the container just up and exited. It'd be nice to handle all the YARN-side failures like localization, launching, etc., but that should be handled in a separate YARN JIRA. Most things that happen once the MRAppMaster gets going could be handled better by the AM itself. Perfect example is in the description or things like MAPREDUCE-4937 , where the AM could do a much better job informing the user of the error by registering with the RM then unregistering with failure and a useful diagnostic and also generating job history so it's in the "usual places" for users.
          Hide
          Sandy Ryza added a comment -

          I didn't have a specific plan yet, so any guidance on the best course would be helpful. I was planning to only handle things on the YARN side in this JIRA.

          It seems like the question that needs to be answered is, what information is percolated up to the user in the following (not necessarily mutually-exclusive) situations:

          • AM is killed by NM for going over resource limits
          • AM OOMEs
          • AM dies before registering with the RM
          • AM dies after registering with the RM
          • AM is killed by RM (preempted or because of logic error)
          • AM localization fails
          • AM launch fails

          It looks like the issue raised in MAPREDUCE-3949 about a container being killed when ever resource limits is now fixed - a sensible diagnostic shows up both on the command line and in the UI. I haven't yet tested what happens when the AM dies unexpectedly after being registered, but if log aggregation is on its logs should be accessible through the RM UI. When the AM dies before being registered, nothing useful gets reported. I'm not sure what the best thing to do in this case is - at the very least it would be good to report that it died before registering.

          Show
          Sandy Ryza added a comment - I didn't have a specific plan yet, so any guidance on the best course would be helpful. I was planning to only handle things on the YARN side in this JIRA. It seems like the question that needs to be answered is, what information is percolated up to the user in the following (not necessarily mutually-exclusive) situations: AM is killed by NM for going over resource limits AM OOMEs AM dies before registering with the RM AM dies after registering with the RM AM is killed by RM (preempted or because of logic error) AM localization fails AM launch fails It looks like the issue raised in MAPREDUCE-3949 about a container being killed when ever resource limits is now fixed - a sensible diagnostic shows up both on the command line and in the UI. I haven't yet tested what happens when the AM dies unexpectedly after being registered, but if log aggregation is on its logs should be accessible through the RM UI. When the AM dies before being registered, nothing useful gets reported. I'm not sure what the best thing to do in this case is - at the very least it would be good to report that it died before registering.
          Hide
          Jason Lowe added a comment -

          MAPREDUCE-4937 looks like a specific instance of this issue. Are you planning on having the AM register/unregister with the RM for diagnostics and exit with proper job history as part of the fix?

          Show
          Jason Lowe added a comment - MAPREDUCE-4937 looks like a specific instance of this issue. Are you planning on having the AM register/unregister with the RM for diagnostics and exit with proper job history as part of the fix?
          Hide
          Sandy Ryza added a comment -

          Ravi Prakash, mind if I pick this up?

          Show
          Sandy Ryza added a comment - Ravi Prakash , mind if I pick this up?
          Hide
          Eli Reisman added a comment -

          This has been making me crazy. Please fix this. Every time I install hadoop-2.0.x I spend more time than I'd care to admit fiddling to get this to stop happening:

          INFO mapreduce.Job: Job job_1359696651870_0001 failed with state FAILED due to: Application application_1359696651870_0001 failed 1 times due to AM Container for appattempt_1359696651870_0001_000001 exited with exitCode: 1 due to:
          .Failing this attempt.. Failing the application.

          This is not really much to go on.

          Show
          Eli Reisman added a comment - This has been making me crazy. Please fix this. Every time I install hadoop-2.0.x I spend more time than I'd care to admit fiddling to get this to stop happening: INFO mapreduce.Job: Job job_1359696651870_0001 failed with state FAILED due to: Application application_1359696651870_0001 failed 1 times due to AM Container for appattempt_1359696651870_0001_000001 exited with exitCode: 1 due to: .Failing this attempt.. Failing the application. This is not really much to go on.
          Hide
          Vinod Kumar Vavilapalli added a comment -

          This has been a pretty annoying issue with people are running into repeatedly, we need to fix it. Targeting for 0.23.2.

          Show
          Vinod Kumar Vavilapalli added a comment - This has been a pretty annoying issue with people are running into repeatedly, we need to fix it. Targeting for 0.23.2.

            People

            • Assignee:
              Sandy Ryza
              Reporter:
              David Capwell
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development