Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-5988

PollSocketImpl can write to a stale fd.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 1.0.1
    • libprocess
    • Mesosphere Sprint 40, Mesosphere Sprint 41
    • 3

    Description

      When tracking down MESOS-5986 with greggomann and anandmazumdar. We were curious why PollSocketImpl avoids the same issue. It seems that PollSocketImpl has a similar race, however in the case of PollSocketImpl we will simply write to a stale file descriptor.

      One example is PollSocketImpl::send(const char*, size_t):

      https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/poll_socket.cpp#L241-L245

      Future<size_t> PollSocketImpl::send(const char* data, size_t size)
      {
        return io::poll(get(), io::WRITE)
          .then(lambda::bind(&internal::socket_send_data, get(), data, size));
      }
      
      Future<size_t> socket_send_data(int s, const char* data, size_t size)
      {
        CHECK(size > 0);
      
        while (true) {
          ssize_t length = send(s, data, size, MSG_NOSIGNAL);
      
      #ifdef __WINDOWS__
          int error = WSAGetLastError();
      #else
          int error = errno;
      #endif // __WINDOWS__
      
          if (length < 0 && net::is_restartable_error(error)) {
            // Interrupted, try again now.
            continue;
          } else if (length < 0 && net::is_retryable_error(error)) {
            // Might block, try again later.
            return io::poll(s, io::WRITE)
              .then(lambda::bind(&internal::socket_send_data, s, data, size));
          } else if (length <= 0) {
            // Socket error or closed.
            if (length < 0) {
              const string error = os::strerror(errno);
              VLOG(1) << "Socket error while sending: " << error;
            } else {
              VLOG(1) << "Socket closed while sending";
            }
            if (length == 0) {
              return length;
            } else {
              return Failure(ErrnoError("Socket send failed"));
            }
          } else {
            CHECK(length > 0);
      
            return length;
          }
        }
      }
      

      If the last reference to the Socket goes away before the socket_send_data loop completes, then we will write to a stale fd!

      It turns out that we have avoided this issue because in libprocess we happen to keep a reference to the Socket around when sending:

      https://github.com/apache/mesos/blob/1.0.0/3rdparty/libprocess/src/process.cpp#L1678-L1707

      void send(Encoder* encoder, Socket socket)
      {
        switch (encoder->kind()) {
          case Encoder::DATA: {
            size_t size;
            const char* data = static_cast<DataEncoder*>(encoder)->next(&size);
            socket.send(data, size)
              .onAny(lambda::bind(
                  &internal::_send,
                  lambda::_1,
                  socket,
                  encoder,
                  size));
            break;
          }
          case Encoder::FILE: {
            off_t offset;
            size_t size;
            int fd = static_cast<FileEncoder*>(encoder)->next(&offset, &size);
            socket.sendfile(fd, offset, size)
              .onAny(lambda::bind(
                  &internal::_send,
                  lambda::_1,
                  socket,
                  encoder,
                  size));
            break;
          }
        }
      }
      

      However, this may not be true in all call-sites going forward. Currently, it appears that http::Connection can trigger this bug.

      Attachments

        Issue Links

          Activity

            People

              greggomann Greg Mann
              bmahler Benjamin Mahler
              Benjamin Mahler Benjamin Mahler
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: