Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-7366

Agent sandbox gc could accidentally delete the entire persistent volume content

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 1.0.2, 1.1.1, 1.2.0
    • 1.0.4, 1.1.2, 1.2.1
    • None
    • None

    Description

      When 1) a persistent volume is mounted, 2) umount is stuck or something, 3) executor directory gc being invoked, agent seems to emit a log like:

      ```
      Failed to delete directory <executor_dir>/runs/<uuid>/volume: Device or resource busy
      ```

      After this, the persistent volume directory is empty.

      This could trigger data loss on critical workload so we should fix this ASAP.

      The triggering environment is a custom executor w/o rootfs image.

      Please let me know if you need more signal.

      I0407 15:18:22.752624 22758 paths.cpp:536] Trying to chown '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' to user 'uber'
      I0407 15:18:22.763229 22758 slave.cpp:6179] Launching executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 with resources cpus(cassandra-cstar-location-store, cassandra, {resource_id: 29e2ac63-d605-4982-a463-fa311be94e0a}):0.1; mem(cassandra-cstar-location-store, cassandra, {resource_id: 2e1223f3-41a2-419f-85cc-cbc839c19c70}):768; ports(cassandra-cstar-location-store, cassandra, {resource_id: fdd6598f-f32b-4c90-a622-226684528139}):[31001-31001] in work directory '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377'
      I0407 15:18:22.764103 22758 slave.cpp:1987] Queued task 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
      I0407 15:18:22.766253 22764 containerizer.cpp:943] Starting container d5a56564-3e24-4c60-9919-746710b78377 for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014
      I0407 15:18:22.767514 22766 linux.cpp:730] Mounting '/var/lib/mesos/volumes/roles/cassandra-cstar-location-store/d6290423-2ba4-4975-86f4-ffd84ad138ff' to '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' for persistent volume disk(cassandra-cstar-location-store, cassandra, {resource_id: fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 of container d5a56564-3e24-4c60-9919-746710b78377
      I0407 15:18:22.894340 22768 containerizer.cpp:1494] Checkpointing container's forked pid 6892 to '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/pids/forked.pid'
      I0407 15:19:01.011916 22749 slave.cpp:3231] Got registration for executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 from executor(1)@10.14.6.132:36837
      I0407 15:19:01.031939 22770 slave.cpp:2191] Sending queued task 'node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4' to executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
      I0407 15:26:14.012861 22749 linux.cpp:627] Removing mount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/fra
      meworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a5656
      4-3e24-4c60-9919-746710b78377/volume' for persistent volume disk(cassandra-cstar-location-store, cassandra, {resource_id: fefc15d6-0c6f-4eac-a3f8-c34d0335c5ec})[d6290423-2ba4-4975-86f4-ffd84ad138ff:volume]:6466445 of container d5a56564-3e24-4c60-9919-746710b78377
      E0407 15:26:14.013828 22756 slave.cpp:3903] Failed to update resources for container d5a56564-3e24-4c60-9919-746710b78377 of executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' running task node-29__c6fdf823-e31a-4b78-a34f-e47e749c07f4 on status update for terminal task, destroying container: Collect failed: Failed to unmount unneeded persistent volume at '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Failed to unmount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Device or resource busy
      I0407 15:26:14.545647 22747 linux.cpp:810] Unmounting volume '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume' for container d5a56564-3e24-4c60-9919-746710b78377
      E0407 15:26:14.546123 22753 slave.cpp:4520] Termination of executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 failed: Failed to clean up an isolator when destroying container: Failed to unmount volume '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Failed to unmount '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377/volume': Device or resource busy
      I0407 15:26:14.566028 22744 slave.cpp:4646] Cleaning up executor 'node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' of framework 5d030fd5-0fb6-4366-9dee-706261fa0749-0014 at executor(1)@10.14.6.132:36837
      I0407 15:26:14.566186 22768 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' for gc 6.99999344714074days in the future
      I0407 15:26:14.566299 22768 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' for gc 6.99999344665481days in the future
      I0407 15:26:14.566337 22768 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7/runs/d5a56564-3e24-4c60-9919-746710b78377' for gc 6.99999344637926days in the future
      I0407 15:26:14.566368 22768 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/91ec544d-ac98-4958-bd7f-85d1f7822421-S3296/frameworks/5d030fd5-0fb6-4366-9dee-706261fa0749-0014/executors/node-29_executor__7eeb4a92-4849-4de5-a2d0-90f64705f5d7' for gc 6.99999344597333days in the future
      

      Attachments

        Activity

          People

            jieyu Jie Yu
            zhitao Zhitao Li
            Jie Yu Jie Yu
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: