Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3879

potential abort propagation issue (race?)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason (AM preemption).
      On the nodes, the abort is handled by TezTaskRunner2 and it looks like there's some race there for some cases.
      Most tasks receive abort normally, like so (the first thing Hive TezProcessor does on any abort is log "Received abort").

      2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort attempt_1513367667720_3619_1_02_000012_0 due to an invocation of shutdownRequested
      2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
      2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to RecordProcessor
      2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to mapOp: {} MAP
      

      However on some tasks that are terminated shortly after init, TezProcessor is never called. Moreover, when AM tries to kill the task again (when it's already running, having ignored the abort) Tez says the task is already aborted and doesn't propagate this either.

      2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 (1513367667720_3619_3_02_000012_0)] reducesink.VectorReduceSinkCommonOperator: Using tag = -1
      (this is the end of Hive init)
      ...
      2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort attempt_1513367667720_3619_3_02_000012_0 due to an invocation of shutdownRequested
      (no TezProcessor log statements)
      

      The task keeps running and the next kill is ignored

      2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received terminateFragment request for attempt_1513367667720_3619_3_02_000012_0
      ...
      2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the task with id attempt_1513367667720_3619_3_02_000012_0 has ended for reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            sershe Sergey Shelukhin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: