Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-676

Tez job fails on client side if nodemanager running AM is lost

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.4.0
    • None
    • None

    Description

      Scenario:

      1) Run a long running Teragen Job
      2) Find out the node where AM has started.
      3) Kill nodemanager on AM host using kill -9 command

      Expected:
      2nd AM should be started and Job should be resumed. Job should also keep running on client side

      Actual:
      Here, the 1st am was started and then NM running AM was killed. The job wait for around 10 min to start 2nd AM. After that, 2nd AM attempt was started. Just at the same time, job output says that "job failed" and it exited.
      Though RM has already started 2nd AM. Gradually 2nd AM runs are job finishes successfully.

      Attachments

        1. TEZ-676.1.patch
          8 kB
          Hitesh Shah
        2. TEZ-676.2.patch
          8 kB
          Hitesh Shah
        3. TEZ-676.3.patch
          9 kB
          Hitesh Shah
        4. TEZ-676.4.patch
          12 kB
          Hitesh Shah

        Issue Links

          Activity

            People

              hitesh Hitesh Shah
              yeshavora Yesha Vora
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: