Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-284

Add 'terminationCause' tracking to DAGImpl and VertexImpl

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      By tracking a reasonably exact cause of termination, we can be more precise with state-machine logic and facilitate post-mortems.
      For example, DAGImpl and VertexImpl use a checkXYZforCompletion() method to determine whether to transition to a final state. In non-success cases, the root cause determines if we transition to FAILED or KILLED.

      This helps implement TEZ-141 "DAG does not kill running vertices when going into failed state" and TEZ-143 "Vertex doesn not kill other running tasks when it fails due to a task failure". (Simpler solutions for state-machine issues are available but the general tracking of root-causes seems valuable for its port-mortem uses).

      The initial improvement is to get general support going.. later JIRAs will add more diagnostics support and additional 'root causes' as necessary.

      Example:
      public enum VertexTerminationCause {

      /** DAG was killed */
      DAG_KILL,

      /** Other vertex failed causing DAG to fail thus killing this vertex */
      OTHER_VERTEX_FAILURE,

      /** One of the tasks for this vertex failed. */
      OWN_TASK_FAILURE,

      /** This vertex failed during commit. */
      COMMIT_FAILURE,

      /** This vertex failed as it had zero tasks. */
      ZERO_TASKS,

      /** This vertex failed during init. */
      GENERIC_INIT_FAILURE

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                mikeliddell Mike Liddell
                Reporter:
                mikeliddell Mike Liddell
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: