Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4506

Report the node of a task attempt failure better

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.10.3
    • None
    • None

    Description

      There was an OOM in a hive application, and in the console, I can see things like this:

      Error: Error while compiling statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Reducer 3, vertexId=vertex_1686687930454_0003_1_10, diagnostics=[Task failed, taskId=task_1686687930454_0003_1_10_000201, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : java.lang.RuntimeException: Reducer 3 operator initialization failed
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
      	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
      	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
      	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
      	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
      	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async Initialization failed. abortRequested=false
      	at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
      	at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
      	at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
      	... 17 more
      Caused by: java.lang.OutOfMemoryError: Java heap space
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown Source)
      	at org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
      	at org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	... 3 more
      , errorMessage=Cannot recover from this error:java.lang.RuntimeException: Reducer 3 operator initialization failed
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:265)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:268)
      	at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:252)
      	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:75)
      	at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:62)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:422)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
      	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:62)
      	at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:38)
      	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
      	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
      	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Async Initialization failed. abortRequested=false
      	at org.apache.hadoop.hive.ql.exec.Operator.completeInitialization(Operator.java:464)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:398)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
      	at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:571)
      	at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:523)
      	at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:384)
      	at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:237)
      	... 17 more
      Caused by: java.lang.OutOfMemoryError: Java heap space
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.expandAndRehash(VectorMapJoinFastLongHashTable.java:166)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.add(VectorMapJoinFastLongHashTable.java:100)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashTable.adaptPutRow(VectorMapJoinFastLongHashTable.java:91)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastLongHashMap.putRow(VectorMapJoinFastLongHashMap.java:147)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastTableContainer.putRow(VectorMapJoinFastTableContainer.java:184)
      	at org.apache.hadoop.hive.ql.exec.vector.mapjoin.fast.VectorMapJoinFastHashTableLoader.load(VectorMapJoinFastHashTableLoader.java:130)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTableInternal(MapJoinOperator.java:385)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.loadHashTable(MapJoinOperator.java:454)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator.lambda$initializeOp$0(MapJoinOperator.java:238)
      	at org.apache.hadoop.hive.ql.exec.MapJoinOperator$$Lambda$38/1790423335.call(Unknown Source)
      	at org.apache.hadoop.hive.ql.exec.tez.ObjectCache.retrieve(ObjectCache.java:96)
      	at org.apache.hadoop.hive.ql.exec.tez.ObjectCache$1.call(ObjectCache.java:113)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	... 3 more
      ]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:517, Vertex vertex_1686687930454_0003_1_10 [Reducer 3] killed/failed due to:OWN_TASK_FAILURE]Vertex killed, vertexName=Reducer 4, vertexId=vertex_1686687930454_0003_1_11, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:2, Vertex vertex_1686687930454_0003_1_11 [Reducer 4] killed/failed due to:OTHER_VERTEX_FAILURE]Vertex killed, vertexName=Reducer 5, vertexId=vertex_1686687930454_0003_1_12, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_1686687930454_0003_1_12 [Reducer 5] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2 (state=08S01,code=2)
      Closing: 0: jdbc:hive2://vc0801.halxg.cloudera.com:2181,vc0824.halxg.cloudera.com:2181,vc0922.halxg.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/root/hive_truststore.jks;trustStorePassword=9mMcEG04Ao1DGtG1Gt2AmDZBWYRwKkiU1EaL7kJvJvy;principal=hive/vc0801.halxg.cloudera.com@HALXG.CLOUDERA.COM;ssl=true
      

      the problem: even if there are thing like "diagnostics", I cannot see on which node the actual task attempt ran, so I need to dig into the logs, however I know there is -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp configured for the tasks, so with the node info, I can easily go to the node directly and look at /tmp

      I guess we need to look around here: https://github.com/apache/tez/blob/5beab4ced9d07bc813a8d79ded111b72af5a2f02/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskReporter.java#L397

      Attachments

        Issue Links

          Activity

            People

              ayushtkn Ayush Saxena
              abstractdog László Bodor
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m