Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7622

Agent can crash if a HTTP executor tries to retry subscription in running state.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Won't Fix
    • 1.2.2
    • None
    • agent, executor

    Description

      It is possible that a running executor might retry its subscribe request. This can lead to a crash if it previously had any launched tasks. Note that the executor would still be able to subscribe again when the agent process restarts and is recovering.

      sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime --image_providers=docker --image_provisioner_backend=overlay --containerizers=mesos --launcher_dir=$(pwd) --executor_environment_variables='{"LD_LIBRARY_PATH": "/home/aaron/Code/src/mesos/build/src/.libs"}'
      WARNING: Logging before InitGoogleLogging() is written to STDERR
      I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by aaron
      I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
      I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
      I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
      I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice `mesos_executors.slice`
      I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
      I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
      I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
      E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' failed; this is the output:
      sh: 1: hadoop: not found
      I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 'hadoop' as it could not be created: Failed to create HDFS client: Failed to execute 'hadoop version 2>&1'; the command was either not found or exited with a non-zero exit status: 127
      I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 'overlay'
      I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on (1)@127.0.1.1:5051
      I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}" --executor_registration_timeout="1mins" --executor_reregistration_timeout="2secs" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_command_executor="false" --http_heartbeat_interval="30secs" --image_providers="docker" --image_provisioner_backend="overlay" --initialize_driver_logging="true" --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime" --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/tmp/slave"
      I0605 14:58:23.786392 10710 slave.cpp:552] Agent resources: cpus(*):6; mem(*):6956; disk(*):41113; ports(*):[31000-32000]
      I0605 14:58:23.786437 10710 slave.cpp:560] Agent attributes: [  ]
      I0605 14:58:23.786468 10710 slave.cpp:565] Agent hostname: U64
      I0605 14:58:23.786574 10714 status_update_manager.cpp:177] Pausing sending status updates
      I0605 14:58:23.787470 10718 state.cpp:62] Recovering state from '/tmp/slave/meta'
      I0605 14:58:23.787698 10713 status_update_manager.cpp:203] Recovering status update manager
      I0605 14:58:23.787755 10713 containerizer.cpp:582] Recovering containerizer
      I0605 14:58:23.788499 10714 linux_launcher.cpp:291] Recovered container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.788805 10714 linux_launcher.cpp:373] 0fbfe58f-a1bf-4128-83ed-81c2901299b7 is a known orphaned container
      I0605 14:58:23.790591 10714 memory.cpp:479] Started listening for OOM events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.791183 10714 memory.cpp:590] Started listening on 'low' memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.791714 10714 memory.cpp:590] Started listening on 'medium' memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.792260 10714 memory.cpp:590] Started listening on 'critical' memory pressure events for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.792899 10717 provisioner.cpp:372] Recovered container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.793344 10717 metadata_manager.cpp:236] Successfully loaded 1 Docker images
      I0605 14:58:23.793432 10717 provisioner.cpp:416] Provisioner recovery complete
      I0605 14:58:23.793514 10717 containerizer.cpp:939] Cleaning up orphan container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.793536 10717 containerizer.cpp:2101] Destroying container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 in RUNNING state
      I0605 14:58:23.793669 10717 slave.cpp:6127] Finished recovery
      I0605 14:58:23.793853 10711 linux_launcher.cpp:505] Asked to destroy container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.794039 10717 slave.cpp:945] New master detected at master@10.0.2.15:5050
      I0605 14:58:23.794100 10717 slave.cpp:969] No credentials provided. Attempting to register without authentication
      I0605 14:58:23.794109 10717 slave.cpp:980] Detecting new master
      I0605 14:58:23.794131 10717 status_update_manager.cpp:177] Pausing sending status updates
      I0605 14:58:23.794304 10711 linux_launcher.cpp:548] Using freezer to destroy cgroup mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.794937 10717 cgroups.cpp:2692] Freezing cgroup /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.795821 10711 cgroups.cpp:1405] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after 850944ns
      I0605 14:58:23.813447 10714 cgroups.cpp:2710] Thawing cgroup /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.816241 10714 cgroups.cpp:1434] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/0fbfe58f-a1bf-4128-83ed-81c2901299b7 after 2.750976ms
      I0605 14:58:23.851388 10711 containerizer.cpp:2507] Container 0fbfe58f-a1bf-4128-83ed-81c2901299b7 has exited
      W0605 14:58:23.853590 10712 disk.cpp:448] Ignoring cleanup for unknown container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:23.856833 10711 provisioner.cpp:576] Destroying container rootfs at '/tmp/slave/provisioner/containers/0fbfe58f-a1bf-4128-83ed-81c2901299b7/backends/overlay/rootfses/ade95f11-60ec-47ee-9e4e-bcbc63cc5eaa' for container 0fbfe58f-a1bf-4128-83ed-81c2901299b7
      I0605 14:58:24.572808 10712 slave.cpp:1148] Registered with master master@10.0.2.15:5050; given agent ID eb661932-585d-453a-bc64-a720e05ce634-S0
      I0605 14:58:24.572888 10711 status_update_manager.cpp:184] Resuming sending status updates
      I0605 14:58:24.573446 10712 slave.cpp:1206] Forwarding total oversubscribed resources {}
      I0605 14:58:24.579311 10711 slave.cpp:1632] Got assigned task 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:24.580113 10711 slave.cpp:1913] Authorizing task 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:24.580293 10711 slave.cpp:2100] Launching task 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:24.580703 10711 paths.cpp:573] Trying to chown '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a' to user 'root'
      I0605 14:58:24.581671 10711 slave.cpp:7086] Launching executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 with resources cpus(*)(allocated: *):0.1; mem(*)(allocated: *):32 in work directory '/tmp/slave/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a'
      I0605 14:58:24.581881 10711 slave.cpp:2795] Launching container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a for executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:24.582072 10712 containerizer.cpp:1056] Starting container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.582391 10711 slave.cpp:2329] Queued task 'testapp-cc6e64001fee44e3a20d7a15149d8b34' for executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:24.583070 10712 provisioner.cpp:459] Provisioning image rootfs '/tmp/slave/provisioner/containers/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/backends/overlay/rootfses/736ac735-51b1-4bc9-892b-15330d0784f9' for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a using overlay backend
      I0605 14:58:24.587167 10715 memory.cpp:479] Started listening for OOM events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.587508 10715 memory.cpp:590] Started listening on 'low' memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.587805 10715 memory.cpp:590] Started listening on 'medium' memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.588145 10715 memory.cpp:590] Started listening on 'critical' memory pressure events for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.588714 10715 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 64MB for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.589285 10715 memory.cpp:228] Updated 'memory.limit_in_bytes' to 64MB for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.589588 10715 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:24.591984 10711 linux_launcher.cpp:429] Launching container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a and cloning with namespaces CLONE_NEWNS
      I0605 14:58:24.593714 10711 systemd.cpp:96] Assigned child process '10736' to 'mesos_executors.slice'
      I0605 14:58:24.595461 10712 containerizer.cpp:1722] Checkpointing container's forked pid 10736 to '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/pids/forked.pid'
      I0605 14:58:25.247808 10718 slave.cpp:3825] Got registration for executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459
      I0605 14:58:25.248884 10715 disk.cpp:208] Updating the disk resources for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a to cpus(*)(allocated: *):0.2; mem(*)(allocated: *):64; disk(*)(allocated: *):1024
      I0605 14:58:25.249397 10716 memory.cpp:199] Updated 'memory.soft_limit_in_bytes' to 64MB for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:25.250066 10716 cpu.cpp:101] Updated 'cpu.shares' to 204 (cpus 0.2) for container cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a
      I0605 14:58:25.250365 10715 slave.cpp:2542] Sending queued task 'testapp-cc6e64001fee44e3a20d7a15149d8b34' to executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459
      I0605 14:58:25.253556 10716 slave.cpp:4295] Handling status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 from executor(1)@127.0.1.1:42459
      I0605 14:58:25.254091 10714 status_update_manager.cpp:323] Received status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:25.254287 10714 status_update_manager.cpp:834] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:25.254456 10714 slave.cpp:4735] Forwarding the update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to master@10.0.2.15:5050
      I0605 14:58:25.254534 10714 slave.cpp:4645] Sending acknowledgement for status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 to executor(1)@127.0.1.1:42459
      I0605 14:58:25.260128 10712 status_update_manager.cpp:395] Received status update acknowledgement (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:25.260184 10712 status_update_manager.cpp:834] Checkpointing ACK for status update TASK_RUNNING (UUID: 0292182c-35e6-4df8-8583-fe11f726c211) for task testapp-cc6e64001fee44e3a20d7a15149d8b34 of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001
      I0605 14:58:25.352159 10712 http.cpp:1115] HTTP POST for /slave(1)/api/v1/executor from 10.0.2.15:55512 with User-Agent='mesos-framework-sdk'
      I0605 14:58:25.352342 10712 slave.cpp:3609] Received Subscribe request for HTTP executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 at executor(1)@127.0.1.1:42459
      I0605 14:58:25.352388 10712 slave.cpp:3672] Creating a marker file for HTTP based executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' of framework b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001 (via HTTP) at path '/tmp/slave/meta/slaves/eb661932-585d-453a-bc64-a720e05ce634-S0/frameworks/b9d7ab7a-f123-4a7c-bfda-07c483ece870-0001/executors/testapp-cc6e64001fee44e3a20d7a15149d8b34/runs/cb57b657-4fbb-4fec-9e00-8ca4f0bd0e4a/http.marker'
      F0605 14:58:25.352444 10712 slave.cpp:3685] Check failed: executor->launchedTasks.empty()  Newly registered executor 'testapp-cc6e64001fee44e3a20d7a15149d8b34' has launched tasks
      *** Check failure stack trace: ***
          @     0x7fb835fe30ed  google::LogMessage::Fail()
          @     0x7fb835fe4ef0  google::LogMessage::SendToLog()
          @     0x7fb835fe2cb3  google::LogMessage::Flush()
          @     0x7fb835fe5939  google::LogMessageFatal::~LogMessageFatal()
          @     0x7fb8346e126c  mesos::internal::slave::Slave::subscribe()
          @     0x7fb8345f3169  mesos::internal::slave::Http::executor()
          @     0x7fb8346b1107  _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEERKNS2_7RequestERK6OptionINS2_14authentication9PrincipalEEEZN5mesos8internal5slave5Slave10initializeEvEUlS7_SD_E1_E9_M_invokeERKSt9_Any_dataS7_SD_
          @     0x7fb833964a10  _ZNSt17_Function_handlerIFN7process6FutureINS0_4http8ResponseEEEvEZZNKS0_9_DeferredIZZNS0_11ProcessBase6_visitERKNS7_12HttpEndpointERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS0_5OwnedINS2_7RequestEEEENKUlRK6OptionINS2_14authentication20AuthenticationResultEEE0_clEST_EUlbE1_EcvSt8functionIFT_T0_EEIS4_RKbEEvENKUlS14_E_clES14_EUlvE_E9_M_invokeERKSt9_Any_data
          @     0x7fb83465fffe  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8internal8DispatchINS0_6FutureINS0_4http8ResponseEEEEclIRSt8functionIFS9_vEEEES9_RKNS0_4UPIDEOT_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
          @     0x7fb8339607ec  process::ProcessBase::visit()
          @     0x7fb83396c03b  process::ProcessManager::resume()
          @     0x7fb8339755bf  _ZNSt6thread11_State_implISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
          @     0x7fb832a3e83f  (unknown)
          @     0x7fb83596d6da  start_thread
          @     0x7fb8324ad17f  (unknown)
      Aborted (core dumped)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            aaron.wood Aaron Wood
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: