Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3972

Tez DAG can hang when a single task fails to fetch

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.9.1
    • 0.9.2, 0.10.0
    • None
    • None

    Description

      Description of the hung DAG:
      A DAG with 2 vertices. Map Vertex has 22k maps, downstream vertex Reduce has 1009 tasks. All tasks succeed but one, which hangs. This one task (attempt) is doing a local fetch from a node that (now) has a bad disk. It fails to fetch and reports to the AM for the offending input attempt identifiers. However the AM does not schedule a re-run as uniquefailedOutputReports size is 1 (since only this task attempt failed to fetch) and failure fraction is not met. The denominator for this fraction is the total number of tasks. That causes the re-run to never occur. This JIRA tracks the AM side of the change to alleviate this problem.

      Attachments

        1. TEZ-3972.003.patch
          10 kB
          Kuhu Shukla
        2. TEZ-3972.002.patch
          9 kB
          Kuhu Shukla
        3. TEZ-3972.001.patch
          10 kB
          Kuhu Shukla

        Issue Links

          Activity

            People

              kshukla Kuhu Shukla
              kshukla Kuhu Shukla
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: