Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-10011

Operation feedback with stale agent ID crashes the master

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Accepted
    • Critical
    • Resolution: Unresolved
    • 1.9.0
    • None
    • agent, master

    Description

      We have observed the following in our environment.

      F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
      *** Check failure stack trace: ***
          @     0x7fd36ca9cf4d  google::LogMessage::Fail()
          @     0x7fd36ca9f13d  google::LogMessage::SendToLog()
          @     0x7fd36ca9ca87  google::LogMessage::Flush()
          @     0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
          @     0x7fd36b5b3446  mesos::internal::master::Master::updateOperationStatus()
      

      This follows registration of an agent that has changed its agent ID due to losing its local state.

      The check failure code is inĀ Master::removeOperation.

      The masters would enter a crash loop unless the operation checkpoint state (i.e., resources_and_operations.state) on the offending agent is deleted.

      Even thought we try to minimize the cases where an agent would lose its state, it can still happen when the latest symlink is removed either by an operator or automatically in certain cases.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            xujyan Yan Xu

            Dates

              Created:
              Updated:

              Slack

                Issue deployment