Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-1790

DeallocationTaskRequest may been handled before corresponding AllocationTaskRequest in local mode

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.6.0
    • None
    • None

    Description

      In Tez Local mode, when dag is kiiled, DeallocationTaskRequest may been handled before corresponding AllocationTaskRequest handled. In that case, The TaskRequest is not really deallocated. The AllocationTaskRequest will been handled after DeallocationTaskRequest. When it is in local session mode, the dag is killed but its TaskRequest is still there, and will continue launch the task attempt. The task attempt will start the heartbeat with the AM, while the AM has started a new DAG. It would cause the following exception. ( The task attempt is heartbeating with a wrong DAG, because its DAG has been killed)

      15:38:24,208 - Thread(TaskHeartbeatThread) - (TezTaskRunner.java:333) - TaskReporter reported error
      java.lang.NullPointerException
      	at org.apache.tez.dag.app.TaskAttemptListenerImpTezDag.heartbeat(TaskAttemptListenerImpTezDag.java:514)
      	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.heartbeat(TaskReporter.java:249)
      	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:176)
      	at org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call(TaskReporter.java:118)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

      This error will cause the TezChild interuppted

      16:04:26,718 - Thread(TezChild) - (TezTaskRunner.java:221) - Encounted an error while executing task: attempt_1416384252992_0001_2_00_000000_0
      java.lang.InterruptedException
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
      	at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
      	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:439)
      	at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:193)
      	at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.initialize(LogicalIOProcessorRuntimeTask.java:211)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:173)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
      	at java.security.AccessController.doPrivileged(Native Method)
      	at javax.security.auth.Subject.doAs(Subject.java:415)
      	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:168)
      	at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:163)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

      This issue cause TestExceptionPropagation timeout sometimes, especially on windows

      Attachments

        1. TEZ-1790-5.patch
          11 kB
          Jeff Zhang
        2. TEZ-1790-4.patch
          11 kB
          Jeff Zhang
        3. TEZ-1790-3.patch
          10 kB
          Jeff Zhang
        4. TEZ-1790-2.patch
          10 kB
          Jeff Zhang
        5. TEZ-1790.patch
          2 kB
          Jeff Zhang

        Activity

          People

            zjffdu Jeff Zhang
            zjffdu Jeff Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: