Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5629

Agent segfaults after request to '/files/browse'

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 1.0.0
    • None
    • CentOS 7, Mesos 1.0.0-rc1 with patches

    • Mesosphere Sprint 37
    • 3

    Description

      We observed a number of agent segfaults today on an internal testing cluster. Here is a log excerpt:

      Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 status_update_manager.cpp:392] Received status update acknowledgement (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 6d4248cd-2832-4152-b5d0-defbf36f6759-0000
      Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 6d4248cd-2832-4152-b5d0-defbf36f6759-0000
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 (unix time) try "date -d @1466097149" if you are using GNU date ***
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @     0x7ff4d68b12a3 (unknown)
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: ***
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d6431100 (unknown)
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d68b12a3 (unknown)
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7eced33 process::dispatch<>()
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7e7aad7 _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7bd1752 mesos::internal::FilesProcess::authorize()
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7bd1bea mesos::internal::FilesProcess::browse()
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7bd6e43 std::_Function_handler<>::_M_invoke()
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d85478cb _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEE0_clESC_ENKUlRKNS4_IbEEE1_clESG_
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d8551341 process::ProcessManager::resume()
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d8551647 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d6909220 (unknown)
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d6429dc5 start_thread
      Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d615728d __clone
      Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main process exited, code=killed, status=11/SEGV
      Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service entered failed state.
      Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed.
      Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff time over, scheduling restart.
      

      In every case, the stack trace indicates one of the /files/* endpoints; I observed this a number of times coming from browse(), and twice from read().

      The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: this and this, which were done to repair a different segfault issue on the master and agent.

      Thanks go to bmahler for digging into this a bit and discovering a possible cause here, where use of defer() may be necessary to keep execution in the correct context.

      Attachments

        1. test-browse.py
          1 kB
          Greg Mann

        Activity

          People

            js84 Jörg Schad
            greggomann Greg Mann
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: