Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-15

Support for DAG AM recovery

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None
    • None

    Attachments

      Issue Links

        1.
        Generate data to be used for recovery Sub-task Closed Hitesh Shah
        2.
        Support basic AM recovery Sub-task Closed Hitesh Shah
        3.
        Support recovery with edge managers modified at run-time Sub-task Resolved Hitesh Shah
        4.
        Support committer-supported task recovery Sub-task Resolved Hitesh Shah
        5.
        Committer recovery events should be out-of-band Sub-task Closed Hitesh Shah
        6.
        Handle failure to persist events to HDFS Sub-task Closed Hitesh Shah
        7.
        Recovery unit tests Sub-task Closed Jeff Zhang
        8.
        Handle correct recovery for committers with VertexGroups Sub-task Resolved Unassigned
        9.
        Handle zero task vertices correctly on Recovery Sub-task Closed Hitesh Shah
        10.
        Implement more optimal flush/sync mechanism to HDFS Sub-task Resolved Unassigned
        11.
        Handle restore of AMContainer and AMNode states on recovery Sub-task Resolved Unassigned
        12.
        Handle task re-schedules in recovery Sub-task Resolved Unassigned
        13.
        Support counters recovery Sub-task Closed Jeff Zhang
        14.
        Fix non-recovery code path for sessions Sub-task Closed Hitesh Shah
        15.
        Handle Session Tokens for Recovery Sub-task Closed Hitesh Shah
        16.
        Recovery data should not be written on AsyncDispatcher thread Sub-task Resolved Jeff Zhang
        17.
        Remove application logic from RecoveryService Sub-task Resolved Jeff Zhang
        18.
        TestDAGRecover/2 sometimes hang Sub-task Closed Hitesh Shah
        19.
        Fix determination of failed attempts in recovery Sub-task Closed Jeff Zhang
        20.
        Restore dagName Set for duplicate detection in recovered AMs. Sub-task Closed Jeff Zhang
        21.
        setParallelism in recovery does not send event to downstream vertices Sub-task Resolved Unassigned
        22.
        Disable multiple AM attempts if recovery is disabled. Sub-task Closed Hitesh Shah
        23.
        Fix handling of corrupt or empty files in recovery data Sub-task Closed Hitesh Shah
        24.
        Re-factor routing of events to use common code path for normal and recovery flow. Sub-task Resolved Jeff Zhang
        25.
        Use hflush instead of hsync in recovery log Sub-task Closed Hitesh Shah
        26.
        RecoveryParser can find incorrect last DAG ID Sub-task Closed Jeff Zhang
        27.
        AMStartedEvent should not be recovery event Sub-task Closed Jeff Zhang
        28.
        Rename AMLaunchedEvent to AMInitializedEvent Sub-task Resolved Jeff Zhang
        29.
        Add checks to guarantee all init events are written to recovery to consider vertex initialized Sub-task Closed Jeff Zhang
        30.
        Restore successfulAttempt from TaskFinishedEvent instead of TaskAttemptFinishedEvent Sub-task Open Jeff Zhang
        31.
        Move recovery related code into inner class Sub-task Resolved Jeff Zhang
        32.
        Recovery fails due to TaskAttemptFinishedEvent being recorded multiple times for the same task Sub-task Closed Jeff Zhang
        33.
        VertexDataMovementEventsGeneratedEvent may be logged twice in recovery log Sub-task Closed Jeff Zhang
        34.
        Add system tests for AM recovery Sub-task Closed Jeff Zhang
        35.
        Add tests for checking custom vertex managers like auto-reduce parallelism in recovery Sub-task Resolved Jeff Zhang
        36.
        Restore counters from DAGFinishedEvent when DAG is completed Sub-task Closed Jeff Zhang
        37.
        Vertex should always been recovered to FAIL when DAG is committing Sub-task Resolved Jeff Zhang
        38.
        Remove need to copy over all events from attempt 1 to attempt 2 dir Sub-task Closed Jeff Zhang
        39.
        Recovery failure in the case of Auto-reduce parallelism Sub-task Resolved Jeff Zhang
        40.
        Add test for RecoveryEvent Spec Sub-task Resolved Jeff Zhang
        41.
        Recovery of task events (eg. datamovement events) should not depend on ordering of task attempt events Sub-task Resolved Unassigned
        42.
        Refactor recovery event logging to ensure it meet the recovery event spec Sub-task Resolved Jeff Zhang
        43.
        Session stats should be recovered Sub-task Resolved Jeff Zhang
        44.
        Incorrect dag result due to wrong TaskSpec in recovering Sub-task Resolved Jeff Zhang

        Activity

          People

            Unassigned Unassigned
            bikassaha Bikas Saha
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: