Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8422

Master's UpdateSlave handler not correctly updating terminated operations

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.5.0
    • Fix Version/s: 1.5.0
    • Component/s: master
    • Labels:

      Description

      I created a test that verifies that operation status updates are resent to the master after being dropped en route to it (MESOS-8420).

      The test does the following:

      1. Creates a volume from a RAW disk resource.
      2. Drops the first `UpdateOperationStatusMessage` message from the agent to the master, so that it isn't acknowledged by the master.
      3. Restarts the agent.
      4. Verifies that the agent resends the operation status update.

      The good news are that the agent is resending the operation status update, the bad news are that it triggers a CHECK failure that crashes the master.

      Here are the relevant sections of the log produced by the test:

      [ RUN      ] StorageLocalResourceProviderTest.ROOT_RetryOperationStatusUpdateAfterRecovery
      [...]
      I0109 16:36:08.515882 24106 master.cpp:4284] Processing ACCEPT call for offers: [ 046b3f21-6e97-4a56-9a13-773f7d481efd-O0 ] on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681
      I0109 16:36:08.516487 24106 master.cpp:5260] Processing CREATE_VOLUME operation with source disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096 from framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (default) at scheduler-2a48a684-64b4-4b4d-a396-6491adb4f2b1@10.0.49.2:40681 to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
      I0109 16:36:08.518704 24106 master.cpp:10622] Sending operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev)
      I0109 16:36:08.521210 24130 provider.cpp:504] Received APPLY_OPERATION event
      I0109 16:36:08.521276 24130 provider.cpp:1368] Received CREATE_VOLUME operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
      I0109 16:36:08.523131 24432 test_csi_plugin.cpp:305] CreateVolumeRequest '{"version":{"minor":1},"name":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","capacityRange":{"requiredBytes":"4294967296","limitBytes":"4294967296"},"volumeCapabilities":[{"mount":{},"accessMode":{"mode":"SINGLE_NODE_WRITER"}}]}'
      I0109 16:36:08.525806 24152 provider.cpp:2635] Applying conversion from 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096' to 'disk(allocated: storage)(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096' for operation (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408)
      I0109 16:36:08.528725 24134 status_update_manager_process.hpp:152] Received operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
      I0109 16:36:08.529207 24134 status_update_manager_process.hpp:929] Checkpointing UPDATE for operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
      I0109 16:36:08.573177 24150 http.cpp:1185] HTTP POST for /slave(2)/api/v1/resource_provider from 10.0.49.2:53598
      I0109 16:36:08.573974 24139 slave.cpp:7065] Handling resource provider message 'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)'
      I0109 16:36:08.574154 24139 slave.cpp:7409] Updating the state of operation ' with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
      I0109 16:36:08.574785 24139 slave.cpp:7249] Forwarding status update of operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
      I0109 16:36:08.583748 24084 slave.cpp:931] Agent terminating
      I0109 16:36:08.584115 24144 master.cpp:1305] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(2)@10.0.49.2:40681 (core-dev) disconnected
      [...]
      I0109 16:36:08.655766 24140 slave.cpp:1378] Re-registered with master master@10.0.49.2:40681
      I0109 16:36:08.655936 24117 task_status_update_manager.cpp:188] Resuming sending task status updates
      I0109 16:36:08.655995 24149 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
      I0109 16:36:08.656008 24140 slave.cpp:1423] Forwarding agent update {"operations":{},"resource_version_uuid":{"value":"icuAKyO6TymMt2Y9vyF6Jg=="},"slave_id":{"value":"046b3f21-6e97-4a56-9a13-773f7d481efd-S0"},"update_oversubscribed_resources":true}
      I0109 16:36:08.656121 24149 hierarchical.cpp:754] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 reactivated
      W0109 16:36:08.656481 24113 master.cpp:7277] !!!! update slave message: slave_id {
        value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
      }
      update_oversubscribed_resources: true
      operations {
      }
      resource_version_uuid {
        value: "\211\313\200+#\272O)\214\267f=\277!z&"
      }
      I0109 16:36:08.656637 24113 master.cpp:7320] Received update of agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 at slave(3)@10.0.49.2:40681 (core-dev) with total oversubscribed resources {}
      W0109 16:36:08.657387 24113 master.cpp:7704] Performing explicit reconciliation with agent for known operation 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 since it was not present in original reconciliation message from agent
      I0109 16:36:08.657917 24133 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]
      W0109 16:36:08.658048 24125 manager.cpp:472] Dropping operation reconciliation message with operation_uuid 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 because resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01 is not subscribed
      I0109 16:36:08.658609 24143 container_daemon.cpp:119] Launching container 'org-apache-mesos-rp-local-storage-test--org-apache-mesos-csi-test-slrp_test--CONTROLLER_SERVICE-NODE_SERVICE'
      [...]
      I0109 16:36:08.689859 24130 provider.cpp:3066] Sending UPDATE_STATE call with resources 'disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096' and 1 operations to agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
      I0109 16:36:08.690449 24130 provider.cpp:1042] Resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01 is in READY state
      I0109 16:36:08.690491 24105 status_update_manager_process.hpp:385] Resuming operation status update manager
      I0109 16:36:08.690640 24105 status_update_manager_process.hpp:394] Sending operation status update OPERATION_FINISHED (Status UUID: 0c79cdf2-b89d-453b-bb62-57766e968dd0) for operation UUID 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408 of framework '046b3f21-6e97-4a56-9a13-773f7d481efd-0000' on agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0
      I0109 16:36:08.693244 24131 http.cpp:1185] HTTP POST for /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
      I0109 16:36:08.693912 24140 http.cpp:1185] HTTP POST for /slave(3)/api/v1/resource_provider from 10.0.49.2:53606
      I0109 16:36:08.693974 24115 manager.cpp:677] Received UPDATE_STATE call with resources '[{"disk":{"source":{"id":"18b4c4a5-d162-4dcf-bb21-a13c6ee0f408","metadata":{"labels":[{"key":"path","value":"\/tmp\/n5thZ3\/test\/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"}]},"mount":{"root":".\/csi\/org.apache.mesos.csi.test\/slrp_test\/mounts\/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"},"profile":"volume-default","type":"MOUNT"}},"name":"disk","provider_id":{"value":"605b22f5-e39d-4d9f-950a-e7f44d202c01"},"reservations":[{"role":"storage","type":"DYNAMIC"}],"scalar":{"value":4096.0},"type":"SCALAR"}]' and 1 operations from resource provider 605b22f5-e39d-4d9f-950a-e7f44d202c01
      I0109 16:36:08.694897 24144 slave.cpp:7065] Handling resource provider message 'UPDATE_STATE: 605b22f5-e39d-4d9f-950a-e7f44d202c01 disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096'
      I0109 16:36:08.695184 24144 slave.cpp:7182] Forwarding new total resources cpus:2; mem:1024; disk:1024; ports:[31000-32000]; disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096
      I0109 16:36:08.696467 24144 slave.cpp:7065] Handling resource provider message 'UPDATE_OPERATION_STATUS: (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)'
      I0109 16:36:08.696594 24144 slave.cpp:7409] Updating the state of operation ' with no ID (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
      I0109 16:36:08.696666 24144 slave.cpp:7249] Forwarding status update of operation with no ID (operation_uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) for framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000
      W0109 16:36:08.697093 24142 master.cpp:7277] !!!! update slave message: slave_id {
        value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
      }
      update_oversubscribed_resources: false
      operations {
      }
      resource_version_uuid {
        value: "\211\313\200+#\272O)\214\267f=\277!z&"
      }
      resource_providers {
        providers {
          info {
            id {
              value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
            }
            type: "org.apache.mesos.rp.local.storage"
            name: "test"
            default_reservations {
              role: "storage"
              type: DYNAMIC
            }
            storage {
              plugin {
                type: "org.apache.mesos.csi.test"
                name: "slrp_test"
                containers {
                  [...]
                }
              }
            }
          }
          total_resources {
            name: "disk"
            type: SCALAR
            scalar {
              value: 4096
            }
            disk {
              source {
                type: MOUNT
                mount {
                  root: "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                }
                id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                metadata {
                  labels {
                    key: "path"
                    value: "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                  }
                }
                profile: "volume-default"
              }
            }
            provider_id {
              value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
            }
            reservations {
              role: "storage"
              type: DYNAMIC
            }
          }
          operations {
            operations {
              framework_id {
                value: "046b3f21-6e97-4a56-9a13-773f7d481efd-0000"
              }
              slave_id {
                value: "046b3f21-6e97-4a56-9a13-773f7d481efd-S0"
              }
              info {
                type: CREATE_VOLUME
                create_volume {
                  source {
                    name: "disk"
                    type: SCALAR
                    scalar {
                      value: 4096
                    }
                    disk {
                      source {
                        type: RAW
                        profile: "volume-default"
                      }
                    }
                    allocation_info {
                      role: "storage"
                    }
                    provider_id {
                      value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
                    }
                    reservations {
                      role: "storage"
                      type: DYNAMIC
                    }
                  }
                  target_type: MOUNT
                }
              }
              latest_status {
                state: OPERATION_FINISHED
                converted_resources {
                  name: "disk"
                  type: SCALAR
                  scalar {
                    value: 4096
                  }
                  disk {
                    source {
                      type: MOUNT
                      mount {
                        root: "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                      }
                      id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                      metadata {
                        labels {
                          key: "path"
                          value: "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                        }
                      }
                      profile: "volume-default"
                    }
                  }
                  allocation_info {
                    role: "storage"
                  }
                  provider_id {
                    value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
                  }
                  reservations {
                    role: "storage"
                    type: DYNAMIC
                  }
                }
                uuid {
                  value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
                }
              }
              statuses {
                state: OPERATION_FINISHED
                converted_resources {
                  name: "disk"
                  type: SCALAR
                  scalar {
                    value: 4096
                  }
                  disk {
                    source {
                      type: MOUNT
                      mount {
                        root: "./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                      }
                      id: "18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                      metadata {
                        labels {
                          key: "path"
                          value: "/tmp/n5thZ3/test/4GB-18b4c4a5-d162-4dcf-bb21-a13c6ee0f408"
                        }
                      }
                      profile: "volume-default"
                    }
                  }
                  allocation_info {
                    role: "storage"
                  }
                  provider_id {
                    value: "605b22f5-e39d-4d9f-950a-e7f44d202c01"
                  }
                  reservations {
                    role: "storage"
                    type: DYNAMIC
                  }
                }
                uuid {
                  value: "\014y\315\362\270\235E;\273bWvn\226\215\320"
                }
              }
              uuid {
                value: "\030\264\304\245\321bM\317\273!\241<n\340\364\010"
              }
            }
          }
          resource_version_uuid {
            value: "M\250\313j\320\301IG\262\0164e\004\367\304\333"
          }
        }
      }
      I0109 16:36:08.700137 24142 master.cpp:10411] Updating the state of operation '' (uuid: 18b4c4a5-d162-4dcf-bb21-a13c6ee0f408) of framework 046b3f21-6e97-4a56-9a13-773f7d481efd-0000 (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
      I0109 16:36:08.700417 24146 hierarchical.cpp:669] Agent 046b3f21-6e97-4a56-9a13-773f7d481efd-S0 (core-dev) updated with total resources disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
      F0109 16:36:08.700610 24142 master.cpp:11687] CHECK_SOME(resources): disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096; cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
      *** Check failure stack trace: ***
      F0109 16:36:08.700896 24146 hierarchical.cpp:908] CHECK_SOME(updatedTotal): disk(reservations: [(DYNAMIC,storage)])[MOUNT(18b4c4a5-d162-4dcf-bb21-a13c6ee0f408,volume-default):./csi/org.apache.mesos.csi.test/slrp_test/mounts/18b4c4a5-d162-4dcf-bb21-a13c6ee0f408]:4096; cpus:2; mem:1024; disk:1024; ports:[31000-32000] does not contain disk(reservations: [(DYNAMIC,storage)])[RAW(,volume-default)]:4096
      *** Check failure stack trace: ***
          @     0x7ff06d3bbe7e  (unknown)
          @     0x7ff06d3bbe7e  (unknown)
          @     0x7ff06d3bbddd  (unknown)
          @     0x7ff06d3bbddd  (unknown)
          @     0x7ff06d3bb7ee  (unknown)
          @     0x7ff06d3bb7ee  (unknown)
          @     0x7ff06d3be522  (unknown)
          @     0x55c1c6c2be77  _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
          @     0x7ff06d3be522  (unknown)
          @     0x55c1c6c2be77  _ZTSN6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureISt4listIN5mesos5slave13QoSCorrectionESaISF_EEEEEclINS0_IFSI_vEEEEESI_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseISH_EESt14default_deleteISU_EEOSM_S3_E_ISX_SM_St12_PlaceholderILi1EEEEEEE
          @     0x7ff06b729277  (unknown)
          @     0x55c1c6f3be8a  _ZTSN6lambda12CallableOnceIFvRK6ResultIN5mesos2v117resource_provider5EventEEEE10CallableFnINS_8internal7PartialIZNK7process6FutureIS6_E7onReadyISt5_BindIFSt7_Mem_fnIMSG_FbS8_EESG_St12_PlaceholderILi1EEEEbEERKSG_OT_NSG_6PreferEEUlOSQ_S8_E_ISQ_SO_EEEEE
      

      We can see that once the SLRP reregisters with the agent, the following happens:

      1. The agent will send an UpdateSlave message to the master including the converted resources and the CREATE_VOLUME operation with the status OPERATION_FINISHED.
      2. The master will update the agent's resources, including the volume created by the operation.
      3. The agent will resend the operation status update.
      4. The master will try to apply the operation and crash, because it already updated the agent's resources on step #2.

        Attachments

          Activity

            People

            • Assignee:
              bbannier Benjamin Bannier
              Reporter:
              gkleiman Gastón Kleiman
              Shepherd:
              Greg Mann
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: