Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3462

Task attempt failure during container shutdown loses useful container diagnostics

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.7.1
    • Fix Version/s: 0.9.0, 0.8.5
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When a nodemanager kills a task attempt due to excessive memory usage it will send a SIGTERM followed by a SIGKILL. It also sends a useful diagnostic message with the container completion event to the RM which will eventually make it to the AM on a subsequent heartbeat.

      However if the JVM shutdown processing causes an error in the task (e.g.: filesystem being closed by shutdown hook) then the task attempt can report a failure before the useful NM diagnostic makes it to the AM. The AM then records some other error as the task failure reason, and by the time the container completion status makes it to the AM it does not associate that error with the task attempt and the useful information is lost.

        Attachments

        1. TEZ-3462.001.patch
          5 kB
          Eric Badger

          Issue Links

            Activity

              People

              • Assignee:
                ebadger Eric Badger
                Reporter:
                jlowe Jason Darrell Lowe
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: