Description
We observed this in prod
F0214 00:36:15.746939 3827787 master.cpp:11190] Check failed: 'framework' Must be non NULL
which is here in code: https://github.com/apache/mesos/blob/9635d4a2d12fc77935c3d5d166469258634c6b7e/src/master/master.cpp#L11203
Diagnosis
The checks were added in in https://github.com/apache/mesos/commit/cf331184714f692f21988a53fd04fa64fbbb3aba MESOS-8469,
Framework* framework = master->getFramework(event.task_added().task().framework_id()); CHECK_NOTNULL(framework);
However as least when we recover tasks when the agent reregisters after a master failover, the frameworks may not have reregistered yet so they don't show up in the result from master->getFramework. Such checks failed to consider this.
Attachments
Issue Links
- duplicates
-
MESOS-8601 Master crashes during slave reregistration after failover.
- Resolved
- is broken by
-
MESOS-8469 Mesos master might drop some events in the operator API stream
- Resolved