Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2863

Container, node, and logs not available in UI for tasks that fail to launch

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.7.1, 0.8.3
    • None
    • None

    Description

      While running a sample tez job

      tez-examples-*.jar orderedwordcount -Dtez.task.resource.memory.mb=1 -Dtez.task.launch.cmd-opts="-Xmx1m" input output
      

      It was noticed that the Tez UI task attempt http://timelineserverhost:port/ws/v1/timeline/TEZ_TASK_ATTEMPT_ID/attempt_id was missing the TEZ_ATTEMPT_STARTED event

      2015-10-01 10:03:55,344 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_STARTED]: vertexName=Tokenizer, taskId=task_1443711816411_0001_1_00_000000, scheduledTime=1443711835342, launchTime=1443711835342
      2015-10-01 10:03:55,346 [INFO] [Dispatcher thread {Central}] |util.RackResolver|: Resolved localhost to /default-rack
      2015-10-01 10:03:55,356 [INFO] [TaskSchedulerEventHandlerThread] |util.RackResolver|: Resolved localhost to /default-rack
      2015-10-01 10:03:55,364 [INFO] [TaskSchedulerEventHandlerThread] |rm.YarnTaskSchedulerService|: Allocation request for task: attempt_1443711816411_0001_1_00_000000_0 with request: Capability[<memory:1, vCores:1>]Priority[2] host: localhost rack: null
      2015-10-01 10:03:56,639 [INFO] [AMRM Heartbeater thread] |impl.AMRMClientImpl|: Received new token for : localhost:57381
      2015-10-01 10:03:56,646 [INFO] [AMRM Callback Handler Thread] |util.RackResolver|: Resolved localhost to /default-rack
      2015-10-01 10:03:56,648 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|: Assigning container to task: containerId=container_1443711816411_0001_01_000002, task=attempt_1443711816411_0001_1_00_000000_0, containerHost=localhost:57381, containerPriority= 2, containerResources=<memory:1024, vCores:1>, localityMatchType=NodeLocal, matchedLocation=localhost, honorLocalityFlags=true, reusedContainer=false, delayedContainers=0
      2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: Resolved localhost to /default-rack
      2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: Resolved localhost to /default-rack
      2015-10-01 10:03:56,686 [INFO] [TaskSchedulerAppCaller #0] |node.AMNodeTracker|: Adding new node: localhost:57381
      2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] |launcher.ContainerLauncherImpl|: Launching container_1443711816411_0001_01_000002
      2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] |impl.ContainerManagementProtocolProxy|: Opening proxy : localhost:57381
      2015-10-01 10:03:56,741 [INFO] [ContainerLauncher #0] |history.HistoryEventHandler|: [HISTORY][DAG:N/A][Event:CONTAINER_LAUNCHED]: containerId=container_1443711816411_0001_01_000002, launchTime=1443711836741
      2015-10-01 10:03:57,647 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|: Allocated container completed:container_1443711816411_0001_01_000002 last allocated to task: attempt_1443711816411_0001_1_00_000000_0
      2015-10-01 10:03:57,648 [INFO] [Dispatcher thread {Central}] |container.AMContainerImpl|: Container container_1443711816411_0001_01_000002 exited with diagnostics set to Container failed, exitCode=1. Exception from container-launch.
      Container id: container_1443711816411_0001_01_000002
      Exit code: 1
      Stack trace: ExitCodeException exitCode=1: 
      	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
      	at org.apache.hadoop.util.Shell.run(Shell.java:455)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
      	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      
      Container exited with a non-zero exit code 1
      
      2015-10-01 10:03:57,649 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1443711816411_0001_1][Event:CONTAINER_STOPPED]: containerId=container_1443711816411_0001_01_000002, stoppedTime=1443711837649, exitStatus=1
      2015-10-01 10:03:57,652 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Tokenizer, taskAttemptId=attempt_1443711816411_0001_1_00_000000_0, creationTime=1443711835341, allocationTime=0, startTime=0, finishTime=1443711837650, timeTaken=1443711837650, status=FAILED, errorEnum=CONTAINER_EXITED, diagnostics=Container container_1443711816411_0001_01_000002 finished with diagnostics set to [Container failed, exitCode=1. Exception from container-launch.
      Container id: container_1443711816411_0001_01_000002
      Exit code: 1
      Stack trace: ExitCodeException exitCode=1: 
      	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
      	at org.apache.hadoop.util.Shell.run(Shell.java:455)
      	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
      	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
      	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      
      
      Container exited with a non-zero exit code 1
      ], counters=Counters: 0
      2015-10-01 10:03:57,653 [INFO] [Dispatcher thread {Central}] |counters.Limits|: Counter limits initialized with parameters:  GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=1200
      2015-10-01 10:03:57,657 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: Scheduling new attempt for task: task_1443711816411_0001_1_00_000000, currentFailedAttempts: 1, maxFailedAttempts: 4
      2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] |rm.YarnTaskSchedulerService|: Ignoring removal of unknown task: attempt_1443711816411_0001_1_00_000000_0
      2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] |rm.TaskSchedulerEventHandler|: Task: attempt_1443711816411_0001_1_00_000000_0 has no container assignment in the scheduler
      2015-10-01 10:03:57,658 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task attempt completed for vertex: vertex_1443711816411_0001_1_01 [Summation] attempt: attempt_1443711816411_0001_1_00_000000_0 with state: FAILED vertexState: RUNNING
      

      The TASK_ATTEMPTED_STARTED event contains the inProgressURL, however. It makes it very difficult for the user to debug their bad jvm args.

      The stdout for the failed container has the launch failure reason in this sceario

       
      Error occurred during initialization of VM
      Too small initial heap for new size specified
      

      Attachments

        1. TEZ-2863.1.patch
          27 kB
          Jonathan Turner Eagles
        2. TEZ-2863.2.patch
          24 kB
          Jonathan Turner Eagles
        3. TEZ-2863.2-branch-0.7.patch
          31 kB
          Jonathan Turner Eagles
        4. TEZ-2863.3.patch
          31 kB
          Jonathan Turner Eagles
        5. TEZ-2863.3.patch.addendum
          3 kB
          Jeff Zhang
        6. TEZ-2863.3-branch-0.7.patch
          37 kB
          Jonathan Turner Eagles
        7. TEZ-2863.3-branch-0.7.patch.addendum
          0.8 kB
          Jeff Zhang
        8. TEZ-2863.4.patch
          31 kB
          Jonathan Turner Eagles
        9. TEZ-2863.4-branch-0.7.patch
          37 kB
          Jonathan Turner Eagles
        10. TEZ-2863.5.patch
          34 kB
          Jonathan Turner Eagles
        11. TEZ-2863.5-branch-0.7.patch
          40 kB
          Jonathan Turner Eagles

        Activity

          People

            jeagles Jonathan Turner Eagles
            jeagles Jonathan Turner Eagles
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: