Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-1508

JobRunner should not return success until the job is healthy

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.15.0
    • None
    • None

    Description

      It can be frustrating for users when run-app.sh returns success before the job was fully running.

      This happens because the JobRunner currently waits for JobStatus=RUNNING, but in Yarn for example, that happens when the AM is launched, not when all the containers are launched.
      What can go wrong?
      1. The job could stay stuck waiting for containers that it cant get because of capacity issues or an outage.
      2. The job containers may immediately fail due to a runtime error.

      In both cases, the user may go on their merry way because run-app.sh returned successfully, even though the job is already dead. They may not get alerted for some time.

      How do we fix?
      There are a few ways to fix it. Each one progressively harder but progressively better:
      1. Make JobRunner reach out to AM and monitor the needed containers metric until it reaches 0
      2. Expose a new healthy endpoint in the AM which is only set to true when a heartbeat has been received from each of the containers. Have the JobRunner wait on this (with a timeout)
      3. Expose a hook where users can write custom logic to determine job health

      I think #1 is the most bang for buck and the implementation for #1 can easily be extended for #2 later.

      Other notes:
      I don't think this is needed for standalone, since users are directly deploying the processors and can monitor the processes directly.

      Attachments

        Issue Links

          Activity

            People

              jmakes Jake Maes
              jmakes Jake Maes
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: