Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-15 Support for DAG AM recovery
  3. TEZ-2456

Refactor recovery event logging to ensure it meet the recovery event spec

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None

    Description

      Currently we don't have spec for the recovery event logging. Recovery would be fragile to code change. This jira try to define the spec and refactor the recovery event logging to ensure it meet the spec. Hitesh Shah Please help review the following spec I drafted.

      DAG

      • DAGSubmitted/DAGInitializedEvent/DAGStartedEvent must been logged once, Should not log it again when it’s recovered.
      • DAGFinishedEvent may be logged multiple times. ( DAG move from SUCCEEDED from ERROR ? Should we ignore this ? )
      • VertexFinishedEvent should be logged before DAGFinishedEvent

      Vertex

      • RootInputDataInformation must be logged before VertexInitializedEvent
      • DataMovement must be logged before TaskFinishedEvent
      • TaskFinishedEvent must be logged before VertexFinishedEvent
      • VertexInitializedEvent / VertexStartedEvent should only be logged once, should not log again when it’s recovered.
      • VertexFinishedEvent may be logged multiple times. (e.g. Vertex move from SUCCEEDED to FAILED)
      • VertexParallelismUpdatedEvent must be logged before TaskStartedEvent
      • For VertexFinishedEvent (SUCCEEDED), before it there must be at least n TaskFinishedEvent (SUCCEEDED)

      Task

      • If there’s no TaskStartedEvent, TaskFinishedEvent may still be logged (e.g. Task is killed in NEW ) Current’s behavior is that TaskFinishedEvent won’t be logged if there’s no TaskStartedEvent.
      • TaskStartedEvent should only be logged once. Should not log again when it’s recovered.
      • TaskFinishedEvent may be logged multiple times (e.g. Task move from SUCCEEDED to FAILED)
      • For TaskFinishedEvent (SUCCEEDED), before it there must be at least one TaskAttemptFinishedEvent (SUCCEEDED)

      TaskAttempt

      • If there’s no TaskAttemptStartedEvent, TaskAttemptFinishedEvent may still be logged ( e.g. TaskAttempt is killed in NEW ) Current’s behavior is that TaskAttemptFinishedEvent won’t be logged if there’s no TaskAttemptStartedEvent
      • TaskAttemptStartedEvent should only be logged once. Should not log again when it’s recovered.
      • TaskAttemptFinishedEvent may be logged multiple times. (e.g. TaskAttempt move from SUCCEEDED to FAILED)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zjffdu Jeff Zhang
            zjffdu Jeff Zhang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment