Apache Tez
  1. Apache Tez
  2. TEZ-591

Provide mode specific diagnostic information to the Tez client

    Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.2.0
    • Component/s: None
    • Labels:
      None

      Description

      While developing Pig on Tez, I found it's hard to debug DAG failures due to lack of diagnostic information. Currently, the MR Pig client reports the backend error message when there is a job failure. For example, if I have a UDF that throws a runtime exception, I will see the following stack trace in the front-end log file-

      Pig Stack Trace
      ---------------
      ERROR 1066: Unable to open iterator for alias b. Backend error : FAIL IT NOW!
      ...
      Caused by: java.lang.RuntimeException: FAIL IT NOW!
          at Kill.exec(Kill.java:9)
          at Kill.exec(Kill.java:6)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:334)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:383)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:346)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
          at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
          at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
          at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
          at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
          at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
          at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:396)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
      

      Basically, I'd like to do something similar in Tez Pig.

      If there are multiple failed vertices and tasks, it may be not possible to propagate all the backend exceptions to the frontend. But would it be possible to propagate some of first ones at least? Perhaps one per failed vertex? Given that DAGStatus.getDiagnostics() returns a list of Strings, it seems feasible.

      1. TEZ-591.1.patch
        12 kB
        Hitesh Shah

        Issue Links

          Activity

          Hide
          Hitesh Shah added a comment -

          Bikas Saha Most of the logs use vertex id hence the retaining of the vertex id for diagnostics. Can remove if you believe it should not be exposed even as a string in a log message. Will fix the attempts.size in a follow-up patch.

          Show
          Hitesh Shah added a comment - Bikas Saha Most of the logs use vertex id hence the retaining of the vertex id for diagnostics. Can remove if you believe it should not be exposed even as a string in a log message. Will fix the attempts.size in a follow-up patch.
          Hide
          Bikas Saha added a comment -

          If diagnostics are mainly user visible then how about dropping the internal vertex id from the messages?

          +    addDiagnostic("Vertex re-running"
          +      + ", vertexName=" + vertex.getName()
          +      + ", vertexId=" + vertex.getVertexId());
          

          5 can be replaced by attempts.size()?

          +    List<String> diagnostics = new ArrayList<String>(5);
          +    readLock.lock();
          +    try {
          +      for (TaskAttempt att : attempts.values()) {
          
          Show
          Bikas Saha added a comment - If diagnostics are mainly user visible then how about dropping the internal vertex id from the messages? + addDiagnostic( "Vertex re-running" + + ", vertexName=" + vertex.getName() + + ", vertexId=" + vertex.getVertexId()); 5 can be replaced by attempts.size()? + List< String > diagnostics = new ArrayList< String >(5); + readLock.lock(); + try { + for (TaskAttempt att : attempts.values()) {
          Hide
          Hitesh Shah added a comment -

          Diagnostics for a simple test job:

          DAG Status: status=FAILED, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 1 Killed: 1, diagnostics=Vertex failed, vertexName=map, vertexId=vertex_1383611768628_0001_1_00, diagnostics=[Task failed, taskId=task_1383611768628_0001_1_00_000000, diagnostics=[AttemptID:attempt_1383611768628_0001_1_00_000000_0 Info:Error: java.io.IOException: Throwing a simulated error from map
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292)
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211)
          	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125)
          	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201)
          	at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:394)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515)
          	at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442)
          
          Container released by application, AttemptID:attempt_1383611768628_0001_1_00_000000_1 Info:Error: java.io.IOException: Throwing a simulated error from map
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292)
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211)
          	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125)
          	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201)
          	at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:394)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515)
          	at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442)
          
          Container released by application, AttemptID:attempt_1383611768628_0001_1_00_000000_2 Info:Error: java.io.IOException: Throwing a simulated error from map
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292)
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211)
          	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125)
          	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201)
          	at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:394)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515)
          	at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442)
          
          Container released by application, AttemptID:attempt_1383611768628_0001_1_00_000000_3 Info:Error: java.io.IOException: Throwing a simulated error from map
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292)
          	at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211)
          	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201)
          	at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125)
          	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201)
          	at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:394)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515)
          	at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442)
          ], Vertex killed as one or more tasks failed. failedTasks:1]
          Vertex killed, vertexName=reduce, vertexId=vertex_1383611768628_0001_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0]
          DAG failed due to vertex failure. failedVertices:1 killedVertices:1
          
          Show
          Hitesh Shah added a comment - Diagnostics for a simple test job: DAG Status: status=FAILED, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 1 Killed: 1, diagnostics=Vertex failed, vertexName=map, vertexId=vertex_1383611768628_0001_1_00, diagnostics=[Task failed, taskId=task_1383611768628_0001_1_00_000000, diagnostics=[AttemptID:attempt_1383611768628_0001_1_00_000000_0 Info:Error: java.io.IOException: Throwing a simulated error from map at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292) at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201) at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201) at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515) at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442) Container released by application, AttemptID:attempt_1383611768628_0001_1_00_000000_1 Info:Error: java.io.IOException: Throwing a simulated error from map at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292) at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201) at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201) at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515) at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442) Container released by application, AttemptID:attempt_1383611768628_0001_1_00_000000_2 Info:Error: java.io.IOException: Throwing a simulated error from map at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292) at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201) at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201) at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515) at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442) Container released by application, AttemptID:attempt_1383611768628_0001_1_00_000000_3 Info:Error: java.io.IOException: Throwing a simulated error from map at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:292) at org.apache.tez.mapreduce.examples.MRRSleepJob$SleepMapper.map(MRRSleepJob.java:211) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) at org.apache.tez.mapreduce.processor.map.MapProcessor.runNewMapper(MapProcessor.java:201) at org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:125) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:201) at org.apache.hadoop.mapred.YarnTezDagChild$4.run(YarnTezDagChild.java:452) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1515) at org.apache.hadoop.mapred.YarnTezDagChild.main(YarnTezDagChild.java:442) ], Vertex killed as one or more tasks failed. failedTasks:1] Vertex killed, vertexName=reduce, vertexId=vertex_1383611768628_0001_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex killed as other vertex failed. failedTasks:0] DAG failed due to vertex failure. failedVertices:1 killedVertices:1
          Hide
          Hitesh Shah added a comment -

          Committed to master.

          Show
          Hitesh Shah added a comment - Committed to master.

            People

            • Assignee:
              Hitesh Shah
              Reporter:
              Cheolsoo Park
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development