Description
In Tez Local mode, when dag is kiiled, DeallocationTaskRequest may been handled before corresponding AllocationTaskRequest handled. In that case, The TaskRequest is not really deallocated. The AllocationTaskRequest will been handled after DeallocationTaskRequest. When it is in local session mode, the dag is killed but its TaskRequest is still there, and will continue launch the task attempt. The task attempt will start the heartbeat with the AM, while the AM has started a new DAG. It would cause the following exception. ( The task attempt is heartbeating with a wrong DAG, because its DAG has been killed)
15:38:24,208 - Thread(TaskHeartbeatThread) - (TezTaskRunner.java:333) - TaskReporter reported error java.lang.NullPointerException at org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.heartbeat(TaskAttemptListenerImpTezDag.java:514) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:176) at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
This error will cause the TezChild interuppted
16:04:26,718 - Thread(TezChild) - (TezTaskRunner.java:221) - Encounted an error while executing task: attempt_1416384252992_0001_2_00_000000_0 java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219) at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439) at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:211) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:173) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
This issue cause TestExceptionPropagation timeout sometimes, especially on windows