Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
We have GPU resource discovered when the NM bootstrap but not updated through later heatbeat with RM. There should be a monitoring mechanism to check GPU healthy status from time to time and also the corresponding handling.
And YARN-8851 will also handle device's monitoring. There could be some common part between the two.