Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
Observed this issue on 2 instances when I did a flex down of FGS NM & On another instance, this happened when NullPointerException occurred (JIRA Myriad-135).
From Mesos UI, observed that no resources are left to offer, when there was no utilization happening in the cluster, except 3 NMs (2 MP, 1 ZP).
On debugging RM logs, found the NullPointerException which caused the OfferEventHandler thread to exit and no more offers from mesos to myriad after that.
Then, I tried restarting RM again, and resources are back to mesos again
Then, I tried running few mapreduce jobs and observed the issue with Flexing down FGS NM which caused the whole resources offered to myriad to block completely and myriad didn't release any resources after that.
So, it seems that Flexing down NMs procedure only cleanup the active containers & NM itself, but doesn't clean up outstanding offers incase offers are saved to OfferLifeCycle for future task by FGS NMs.
Resources (From mesos-master UI)
=========
CPUs Mem
Total 84 253.9 GB
Used 3.300 6.1 GB
Offered 80.700 247.8 GB
Idle 1.4210854715202004e-14 0 B <------ No Resources available.
Here is the active Offers (blocked) shown on mesos UI for offers:
Offers
=====
ID Framework Host CPUs Mem
…5050-3270-O4151 MyriadAlpha node101-116 0.5 64 MB
…5050-3270-O4149 MyriadAlpha node101-116 0.200 282 MB
…5050-3270-O4147 MyriadAlpha node101-116 1 1.0 GB
…5050-3270-O4145 MyriadAlpha node101-116 1 1.0 GB
…5050-3270-O4143 MyriadAlpha node101-116 1 1.0 GB
…5050-3270-O4141 MyriadAlpha node101-116 1 1.0 GB
…5050-3270-O4139 MyriadAlpha node101-117 24.5 87.8 GB
…5050-3270-O4137 MyriadAlpha node101-116 22.9 87.4 GB
…5050-3270-O4135 MyriadAlpha node101-117 3 3.0 GB
…5050-3270-O4134 MyriadAlpha node101-137 25.6 65.2 GB