[SAMZA-1508] JobRunner should not return success until the job is healthy - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: None
Labels:
None

Description

It can be frustrating for users when run-app.sh returns success before the job was fully running.

This happens because the JobRunner currently waits for JobStatus=RUNNING, but in Yarn for example, that happens when the AM is launched, not when all the containers are launched.
What can go wrong?
1. The job could stay stuck waiting for containers that it cant get because of capacity issues or an outage.
2. The job containers may immediately fail due to a runtime error.

In both cases, the user may go on their merry way because run-app.sh returned successfully, even though the job is already dead. They may not get alerted for some time.

How do we fix?
There are a few ways to fix it. Each one progressively harder but progressively better:
1. Make JobRunner reach out to AM and monitor the needed containers metric until it reaches 0
2. Expose a new healthy endpoint in the AM which is only set to true when a heartbeat has been received from each of the containers. Have the JobRunner wait on this (with a timeout)
3. Expose a hook where users can write custom logic to determine job health

I think #1 is the most bang for buck and the implementation for #1 can easily be extended for #2 later.

Other notes:
I don't think this is needed for standalone, since users are directly deploying the processors and can monitor the processes directly.

Attachments

Issue Links

links to

GitHub Pull Request #367

Activity

People

Assignee:: Jake Maes

Reporter:: Jake Maes

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Nov/17 00:29

Updated:: 18/May/18 21:28

Resolved:: 18/May/18 21:28