Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-4233

Map task should be blamed earlier for local fetch failures

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.10.1
    • None
    • None

    Description

      Fetch failures can be a result of network issue or disk issue. Currently, AM doesn't know about whether the original input read error happened because of a local fetch failure or not. I think if a map output was reported as a subject of local fetch failure, AM should respond earlier, and blame it as soon as possible. Here is a hidden assumption that a disk read should never fail (or relatively rarely compared to network issues).

      When I detected this issue, it was a Kubernetes based LLAP environment, where a daemon completely disappeared and a new daemon - running reducer tasks - assumed that it has map outputs locally, which wasn't the case.
      This patch can help in container mode as well, as we can assume that a local read should work, and if it's not, the original map output data should be re-generated as soon as possible.

      Attachments

        1. TEZ-4233.01.patch
          22 kB
          László Bodor
        2. TEZ-4233.02.patch
          69 kB
          László Bodor
        3. TEZ-4233.03.patch
          70 kB
          László Bodor
        4. TEZ-4233.04.patch
          95 kB
          László Bodor
        5. TEZ-4233.05.patch
          95 kB
          László Bodor

        Issue Links

          Activity

            People

              abstractdog László Bodor
              abstractdog László Bodor
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: