Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3198

Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • 0.7.0, 0.8.2
    • None
    • None
    • None

    Description

      I've seen an increasing number of cases where a single-node failure caused the whole Tez DAG to fail. These scenarios are common in that they involve the last task of a vertex attempting to complete a shuffle where all the peer tasks have already finished shuffling. The last task's attempt encounters errors shuffling one of its inputs and keeps reporting it to the AM. Eventually the attempt decides it must be the cause of the shuffle error and fails. The subsequent attempts all do the same thing, and eventually we hit the task max attempts limit and fail the vertex and DAG.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jlowe Jason Darrell Lowe
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: