Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-3834

slave upgrade framework checkpoint incompatibility

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.24.1
    • Fix Version/s: 0.24.2, 0.25.1, 0.26.1, 0.27.0
    • Component/s: None
    • Labels:
      None

      Description

      We are upgrading from 0.22 to 0.25 and experienced the following crash in the 0.24 slave:

      F1104 05:20:49.162701  1153 slave.cpp:4175] Check failed: frameworkInfo.has_id()
      *** Check failure stack trace: ***
          @     0x7fef9c294650  google::LogMessage::Fail()
          @     0x7fef9c29459f  google::LogMessage::SendToLog()
          @     0x7fef9c293fb0  google::LogMessage::Flush()
          @     0x7fef9c296ce4  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fef9b9a5492  mesos::internal::slave::Slave::recoverFramework()
          @     0x7fef9b9a3314  mesos::internal::slave::Slave::recover()
          @     0x7fef9b9d069c  _ZZN7process8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS4_5state5StateEES9_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSG_FSE_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESP_
          @     0x7fef9ba039f4  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave5SlaveERK6ResultINS8_5state5StateEESD_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSK_FSI_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
      

      As near as I can tell, what happened was this:

      • 0.22 wrote framework.info without the FrameworkID
      • 0.23 had a compatibility check so it was ok with it
      • 0.24 removed the compatibility check in MESOS-2259
      • the framework checkpoint doesn't get rewritten during recovery so when the 0.24 slave starts it reads the 0.22 version
      • 0.24 asserts

        Issue Links

          Activity

          Hide
          jamespeach James Peach added a comment -

          I'm gonna take a crack at a patch for us that restores the compatibility check and also rewrites the framework checkpoint once it is recovered. If the latter is a terrible idea for some reason, I'd love to be educated about it

          Show
          jamespeach James Peach added a comment - I'm gonna take a crack at a patch for us that restores the compatibility check and also rewrites the framework checkpoint once it is recovered. If the latter is a terrible idea for some reason, I'd love to be educated about it
          Hide
          jamespeach James Peach added a comment -

          Removing the check for RECOVERING state in Framework::Framework() appears to be safe and works in limited testing. Vinod Kone since you added that check in e8c73402, do you know of any reason this would be a bad idea?

          Show
          jamespeach James Peach added a comment - Removing the check for RECOVERING state in Framework::Framework() appears to be safe and works in limited testing. Vinod Kone since you added that check in e8c73402 , do you know of any reason this would be a bad idea?
          Hide
          jamespeach James Peach added a comment -
          Show
          jamespeach James Peach added a comment - https://reviews.apache.org/r/40177/ Vinod Kone or Kapil Arya , could you shepherd this bug?
          Hide
          vinodkone Vinod Kone added a comment -

          commit 0bb09121f7f05d9a215a84d87ca381f59a5fd957
          Author: James Peach <jpeach@apache.org>
          Date: Mon Nov 23 15:31:05 2015 -0800

          Re-checkpoint frameworks after agent recovery.

          When performing an upgrade cycle, it is possible for a 0.24 and
          later agent to recover from a framework checkpoint written by 0.22
          or earlier. In this case, we need to compatibly accept a missing
          FrameworkID, and then rewrite the framework checkpoint so that
          subsequent upgrades don't hit the same problem.

          Review: https://reviews.apache.org/r/40177

          Show
          vinodkone Vinod Kone added a comment - commit 0bb09121f7f05d9a215a84d87ca381f59a5fd957 Author: James Peach <jpeach@apache.org> Date: Mon Nov 23 15:31:05 2015 -0800 Re-checkpoint frameworks after agent recovery. When performing an upgrade cycle, it is possible for a 0.24 and later agent to recover from a framework checkpoint written by 0.22 or earlier. In this case, we need to compatibly accept a missing FrameworkID, and then rewrite the framework checkpoint so that subsequent upgrades don't hit the same problem. Review: https://reviews.apache.org/r/40177

            People

            • Assignee:
              jamespeach James Peach
              Reporter:
              jamespeach James Peach
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development