Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
None
-
None
-
None
Description
HiveServer2 optionally maintains a pool of AMs in either Tez or LLAP mode. This is done to amortize the cost of launching a Tez session.
We also try in a shutdown hook to kill all these AMs when HS2 goes down. However, there are cases where HS2 doesn't get the chance to kill these AMs before it goes away. As a result these zombie AMs hang around until the timeout kicks in.
The trouble with the timeout is that we have to set it fairly high. Otherwise the benefit of having pre-launched AMs obviously goes away (in a lightly loaded cluster).
So, if people kill/restart HS2 they often times run into situations where the cluster/queue doesn't have any more capacity for AMs. They either have to manually kill the zombies or wait.
The request is therefore for Tez to maintain a heartbeat to the client. If the client goes away the AM should exit. That way we can keep the AMs alive for a long time regardless of activity and at the same time don't have to worry about them if HS2 goes down.