Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3307

Configurable size of completed task / framework history

    Details

    • Sprint:
      Mesosphere Sprint 26
    • Story Points:
      3

      Description

      We try to make Mesos work with multiple frameworks and mesos-dns at the same time. The goal is to have set of frameworks per team / project on a single Mesos cluster.

      At this point our mesos state.json is at 4mb and it takes a while to assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.

      Here's the problem:

      mesos λ curl -s http://mesos-master:5050/master/state.json | jq .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
         1 "20150606-001827-252388362-5050-5982-0003"
        16 "20150606-001827-252388362-5050-5982-0005"
        18 "20150606-001827-252388362-5050-5982-0029"
        73 "20150606-001827-252388362-5050-5982-0007"
       141 "20150606-001827-252388362-5050-5982-0009"
       154 "20150820-154817-302720010-5050-15320-0000"
       289 "20150606-001827-252388362-5050-5982-0004"
       510 "20150606-001827-252388362-5050-5982-0012"
       666 "20150606-001827-252388362-5050-5982-0028"
       923 "20150116-002612-269165578-5050-32204-0003"
      1000 "20150606-001827-252388362-5050-5982-0001"
      1000 "20150606-001827-252388362-5050-5982-0006"
      1000 "20150606-001827-252388362-5050-5982-0010"
      1000 "20150606-001827-252388362-5050-5982-0011"
      1000 "20150606-001827-252388362-5050-5982-0027"
      
      mesos λ fgrep 1000 -r src/master
      src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 100000;
      src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK = 1000;
      

      Active tasks are just 6% of state.json response:

      mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
             1   14796 4138942
      mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
            16      37  252774
      

      I see four options that can improve the situation:

      1. Add query string param to exclude completed tasks from state.json and use it in mesos-dns and similar tools. There is no need for mesos-dns to know about completed tasks, it's just extra load on master and mesos-dns.

      2. Make history size configurable.

      3. Make JSON serialization faster. With 10000s of tasks even without history it would take a lot of time to serialize tasks for mesos-dns. Doing it every 60 seconds instead of every 5 seconds isn't really an option.

      4. Create event bus for mesos master. Marathon has it and it'd be nice to have it in Mesos. This way mesos-dns could avoid polling master state and switch to listening for events.

      All can be done independently.

      Note to mesosphere folks: please start distributing debug symbols with your distribution. I was asking for it for a while and it is really helpful: https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501

      Perf report for leading master:

      I'm on 0.23.0.

        Attachments

          Activity

            People

            • Assignee:
              klueska Kevin Klues
              Reporter:
              bobrik Ivan Babrou
              Shepherd:
              Benjamin Mahler
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: