Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-2534

Error handling summary event when shutting down AM

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0-alpha
    • 0.6.2, 0.8.0-alpha, 0.7.1
    • None
    • None

    Description

      When AM is shutting down, it will close the summary stream, but there may be still some events in the queue which will cause exception when handling summary event. And this would cause the next AM fail to recover the running dag.
      One way to resolve this issue is to always drain the events in the queue before closing the summary stream (set drainEventsFlag as true), but this flag may be useful in unit test.

      2015-06-03 16:37:15,761 INFO [Thread-1] app.DAGAppMaster: DAGAppMasterShutdownHook invoked
      2015-06-03 16:37:15,761 INFO [Thread-1] app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
      2015-06-03 16:37:15,761 INFO [Thread-1] rm.TaskSchedulerEventHandler: TaskScheduler notified that iSignalled was : true
      2015-06-03 16:37:15,762 INFO [Thread-1] history.HistoryEventHandler: Stopping HistoryEventHandler
      2015-06-03 16:37:15,762 INFO [Thread-1] recovery.RecoveryService: Stopping RecoveryService
      2015-06-03 16:37:15,762 INFO [Thread-1] recovery.RecoveryService: Closing Summary Stream
      2015-06-03 16:37:15,772 INFO [Thread-1] recovery.RecoveryService: Closing Output Stream for DAG dag_1433320263267_0019_1
      2015-06-03 16:37:15,773 ERROR [Dispatcher thread: Central] recovery.RecoveryService: Error handling summary event, eventType=VERTEX_FINISHED
      java.nio.channels.ClosedChannelException
      	at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1622)
      	at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
      	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
      	at java.io.DataOutputStream.write(DataOutputStream.java:107)
      	at com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
      	at com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
      	at com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
      	at org.apache.tez.dag.history.events.VertexFinishedEvent.toSummaryProtoStream(VertexFinishedEvent.java:207)
      	at org.apache.tez.dag.history.recovery.RecoveryService.handleSummaryEvent(RecoveryService.java:373)
      	at org.apache.tez.dag.history.recovery.RecoveryService.handle(RecoveryService.java:285)
      	at org.apache.tez.dag.history.HistoryEventHandler.handleCriticalEvent(HistoryEventHandler.java:105)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl.logJobHistoryVertexCompletedHelper(VertexImpl.java:1890)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl.logJobHistoryVertexFinishedEvent(VertexImpl.java:1869)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl.finished(VertexImpl.java:2107)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl.finished(VertexImpl.java:2125)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl.checkTasksForCompletion(VertexImpl.java:1989)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl$TaskCompletedTransition.transition(VertexImpl.java:3833)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl$TaskCompletedTransition.transition(VertexImpl.java:1)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
      	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
      	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
      	at org.apache.tez.state.StateMachineTez.doTransition(StateMachineTez.java:57)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1799)
      	at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1)
      	at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1954)
      	at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1)
      	at org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
      	at org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
      	at java.lang.Thread.run(Thread.java:745)
      2015-06-03 16:37:15,775 ERROR [Dispatcher thread: Central] recovery.RecoveryService: Adding a flag to ensure next AM attempt does not start up, flagFile=hdfs://localhost:58857/tmp/owc-staging-dir/.tez/application_1433320263267_0019/recovery/1/RecoveryFatalErrorOccurred
      2015-06-03 16:37:15,781 ERROR [Dispatcher thread: Central] recovery.RecoveryService: Recovery failure occurred. Skipping all events
      2015-06-03 16:37:15,781 INFO [HistoryEventHandlingThread] impl.SimpleHistoryLoggingService: Writing event VERTEX_FINISHED to history file
      

      Attachments

        1. TEZ-2534-1.patch
          2 kB
          Jeff Zhang

        Activity

          People

            zjffdu Jeff Zhang
            zjffdu Jeff Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: