Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3075

Revamp bad node handling

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The current logic around that is derived from MR and does not work in all cases.
      Things to consider
      1) Have a notion of probation where machines are put out of service for a period of time (say 5m, 15m and 30m) before being given up for good. This allows more graceful handling of temporary glitches.
      2) Different handling for YARN marking a node as bad vs internal heuritics
      3) Bad nodes should not immediately trigger re-execution of completed work. That should be based on presence of downstream consumers (ie existing demand for that output) and a reasonable indication by other consumers from that node that it cannot serve results (eg. multiple reports of read errors with that node as a source).

      Attachments

        Activity

          People

            Chyler Ying Han
            bikassaha Bikas Saha
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: