Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3718

Better handling of 'bad' nodes

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      At the moment, the default behaviour in case of a node being marked bad is to do nothing other than not schedule new tasks on this node.
      The alternate, via config, is to retroactively kill every task which ran on the node, which causes far too many unnecessary re-runs.

      Proposing the following changes.
      1. KILL fragments which are currently in the RUNNING state (instead of relying on a timeout which leads to the attempt being marked as FAILED after the timeout interval.
      2. Keep track of these failed nodes, and use this as input to the failure heuristics. Normally source tasks require multiple consumers to report failure for them to be marked as bad. If a single consumer reports failure against a source which ran on a bad node, consider it bad and re-schedule immediately. (Otherwise failures can take a while to propagate, and jobs get a lot slower).

      jlowe - think you've looked at this in the past. Any thoughts/suggestions.
      What I'm seeing is retroactive failures taking a long time to apply, and restart sources which ran on a bad node. Also running tasks being counted as FAILURES instead of KILLS.

      Attachments

        1. TEZ-3718.1.patch
          27 kB
          Zhiyuan Yang
        2. TEZ-3718.2.patch
          54 kB
          Zhiyuan Yang
        3. TEZ-3718.3.patch
          38 kB
          Zhiyuan Yang
        4. TEZ-3718.4.patch
          33 kB
          Zhiyuan Yang

        Activity

          People

            zhiyuany Zhiyuan Yang
            sseth Siddharth Seth
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: