Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9786

Race between two REMOVE_QUOTA calls crashes the master.

    XMLWordPrintableJSON

    Details

    • Sprint:
      Resource Mgmt: RI14 Sp 46
    • Story Points:
      2

      Description

      The existence of the quota in the master is validated here:
      https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L700

      Then the quota is removed from master in a deferred method call:
      https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L744

      And then removed from allocator in another deferred call:
      https://github.com/apache/mesos/blob/a9a2acabd03181865055b77cf81e7bb310b236d6/src/master/quota_handler.cpp#L753

      So, there is a race between two simultaneous REMOVE_QUOTA calls.

      We observe this race on a heavily loaded cluster. Currently we suspect that the client retries the call (due to the call being not processed for a long time),  and this triggers the race.

        Attachments

          Activity

            People

            • Assignee:
              mzhu Meng Zhu
              Reporter:
              asekretenko Andrei Sekretenko
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: