[MYRIAD-133] Multiple flexed up NMs try to run on same node, altogether. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Myriad 0.1.0
Component/s: Scheduler
Labels:
None

Description

On a 3 node cluster with latest build running with NM +Executor merge, I am seeing issue with flexing up
Multiple instances of NMs that multiple NMs try to start on same node at same
time altogether.

Here is the existing/Already running tasks from Myriad: (Before multiple NM
flex up)

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state

{"pendingTasks":[], "stagingTasks":[], "activeTasks":[ "nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89", "nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"], "killableTasks":[]}

Then, I tried flexing up 4 instances of Zero-profile NM, Keep note that only 1
Node is without any NM, other 2 nodes already running NMs (See above).

here is the task status from myriad just after flex up and when all NMs were in
active state.

[root@qa101-137 ~]# curl -H "Content-Type: application/json" -X PUT -d
'

{"instances":4, "profile":"zero"}

'
http://testrm.marathon.mesos:8192/api/cluster/flexup

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
python -mjson.tool
{
"activeTasks": [
"nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89",
"nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"
],
"killableTasks": [],
"pendingTasks": [
"nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee",
"nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9",
"nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
],
"stagingTasks": [
"nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0"
]
}

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
python -mjson.tool
{
"activeTasks": [
"nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0",
"nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89",
"nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52",
"nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee",
"nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9",
"nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
],
"killableTasks": [],
"pendingTasks": [],
"stagingTasks": []
}

On Mesos, all 4 NMs tries to start on a single node, and they all in RUNNING
state at some point, and then moved to LOST state after all NMs settled down.
Also, Myriad moved the rest of NON-Successful tasks from active to pending
state later on.

[root@qa101-137 ~]# curl -s http://testrm.marathon.mesos:8192/api/state |
python -mjson.tool
{
"activeTasks": [
"nm.zero.a5e73358-351f-4938-ba3d-9dc759b514e0",
"nm.medium.a8a36268-e365-4fd2-a87c-4c02ac2aeb89",
"nm.small.30e0ce9c-f9da-49de-b927-ab8a58be6d52"
],
"killableTasks": [],
"pendingTasks": [
"nm.zero.cd35db39-30f0-4da5-aa07-67c22cfe40ee",
"nm.zero.ad7d597c-27f8-4e2c-8108-ae675990fdd9",
"nm.zero.5110931a-279e-4f95-b4e6-5d1167d45993"
],
"stagingTasks": []
}
Let me know if need any additional details regarding the issue?

Attachments

Activity

People

Assignee:: Swapnil Daingade

Reporter:: Sarjeet Singh

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 09/Sep/15 00:09

Updated:: 17/Oct/15 20:45

Resolved:: 17/Oct/15 20:45