Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-965

Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node failures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • Flaky DNS cluster

    Description

      If DNS resolution fails for a period of 5-10 seconds, Tez restarts & contra-flows in the query triggering recovery of nearly everything it has run.

      Nodes are getting marked as bad because they can't shuffle (dns resolution failed for all NMs), which results in log lines like

      attempt_1394928384313_0234_1_25_000654_0 blamed for read error from attempt_1394928384313_0234_1_24_000366_0 
      

      And the tasks restart from an earlier vertex.

      When a large number of such failures happen, the tasks shouldn't restart previous vertexes, but instead should flip a circuit & back-off till the network blip disappears.

      Attachments

        Activity

          People

            Unassigned Unassigned
            gopalv Gopal Vijayaraghavan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: