Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8839

Resource provider manager registrar recovery can race with agent on agent state leading to hard failures

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 1.6.0, 1.8.0
    • Fix Version/s: None
    • Component/s: agent, storage, test
    • Labels:
    • Target Version/s:
    • Sprint:
      Mesosphere Sprint 78
    • Story Points:
      3

      Description

      When running in the agent the resource provider manager persists its state into the agent's state. The agent uses a LevelDB state which protects against concurrent access. The way we modelled LevelDB an fetch when a lock is present leads to a failed Future result. When the resource provider manager encounters a failed recovery it emits a fatal error, e.g.,

      11:48:26 F0425 11:48:26.650568 26819 manager.cpp:254] Failed to recover resource provider manager registry: Failed: IO error: lock /tmp/ParentChildContainerTypeAndContentType_AgentContainerAPITest_RecoverNestedContainer_10_HXbQCK/meta/slaves/6645885c-050a-4518-b896-a20b3e72a070-S0/resource_provider_registry/LOCK: already held by process
      11:48:26 *** Check failure stack trace: ***

      We should not fail hard for such recoverable failure scenarios.

        Attachments

        1. log
          26 kB
          Benjamin Bannier

          Issue Links

            Activity

              People

              • Assignee:
                bbannier Benjamin Bannier
                Reporter:
                bbannier Benjamin Bannier
                Shepherd:
                Chun-Hung Hsiao
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: