We have a metric for tracking number of containers running. But we don't have anything to indicate if the job is healthy. This translates to the AM along with all the containers must be running.
Expose a "healthy" metric: It should be 1 if the AM and all containers are running. 0 otherwise.