We try to make Mesos work with multiple frameworks and mesos-dns at the same time. The goal is to have set of frameworks per team / project on a single Mesos cluster.
At this point our mesos state.json is at 4mb and it takes a while to assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
Here's the problem:
Active tasks are just 6% of state.json response:
I see four options that can improve the situation:
1. Add query string param to exclude completed tasks from state.json and use it in mesos-dns and similar tools. There is no need for mesos-dns to know about completed tasks, it's just extra load on master and mesos-dns.
2. Make history size configurable.
3. Make JSON serialization faster. With 10000s of tasks even without history it would take a lot of time to serialize tasks for mesos-dns. Doing it every 60 seconds instead of every 5 seconds isn't really an option.
4. Create event bus for mesos master. Marathon has it and it'd be nice to have it in Mesos. This way mesos-dns could avoid polling master state and switch to listening for events.
All can be done independently.
Note to mesosphere folks: please start distributing debug symbols with your distribution. I was asking for it for a while and it is really helpful: https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
Perf report for leading master:
I'm on 0.23.0.