Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-867

Fix job restart/shutdown in the event of a node outage.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • None
    • None
    • None

    Description

      A number of jobs failed or restarted when we lost a couple hosts in the cluster.
      The theory is that this happened because the AppMaster detects the failed
      container before YARN detects the missing NM, so it tries to run the
      container on that host again, but doesn't handle the connection errors from the NM properly. Switching from a synchronous NM client model to an async model is expected to help, but we need to discuss this.

      Attachments

        1. SAMZA-867_4.patch
          66 kB
          Jake Maes
        2. SAMZA-867_3.patch
          66 kB
          Jake Maes
        3. SAMZA-867_2.patch
          51 kB
          Jake Maes
        4. SAMZA-867.patch
          50 kB
          Jake Maes

        Issue Links

          Activity

            People

              jmakes Jake Maes
              jmakes Jake Maes
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: