Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-9867

Libevent fd cleanup failure may cause hangs in combination with client certificate validation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Accepted
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • 3

    Description

      A listening LibeventSSLSocket will check cryptographic certificate validity during the OpenSSL handshake and afterwards call the `openssl::verify()` function to perform hostname validation and other checks on the client certificate. If these checks fail, the bufferevent is deleted and the connection closed:

      // libevent_ssl_socket.cpp, accept_SSL_callback()
                if (verify.isError()) {
                  VLOG(1) << "Failed accept, verification error: " << verify.error();
                  request->promise.fail(verify.error());
                  SSL_free(ssl);
                  bufferevent_free(bev);
                  // TODO(jmlvanre): Clean up for readability. Consider RAII
                  // or constructing the impl earlier.
                  CHECK(request->socket >= 0);
                  Try<Nothing> close = os::close(request->socket);
                  if (close.isError()) {
                    LOG(FATAL)
                      << "Failed to close socket " << stringify(request->socket)
                      << ": " << close.error();
                  }
                  delete request;
                  return;
                }
      

      However, when we close the socket fd in the above code, libevent had already registered that file descriptor with epoll() to watch for read and write events on that socket. Since the socket is closed, attempts to remove the corresponding fd from the epoll() structs will fail: (See also: https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/)

      [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 (del); write change was 0 (none): Bad file descriptor
      [warn] Epoll MOD(1) on fd 9 failed.  Old events were 6; read change was 0 (none); write change was 2 (del): Bad file descriptor
      

      However, that in itself is harmless since the kernel will remove the kernel object that was associated with fd 9 from the data structure associated with that epoll instance in the kernel. So while we get an error attempting to remove fd 9, there is actually nothing left to remove. However, in a case of epoll failure, libprocess does not adjust the number of readers and writers on that file descriptor:

              // evmap.c, evmap_io_del()
              [...]
              if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
                     return (-1);
      
              [...]
              ctx->nread = nread;
              ctx->nwrite = nwrite;
      

      In the above, ctx is part of an array collecting information for each file descriptor. That still wouldn't be so bad, however libevent also only adds file descriptors to `epoll()` struct the first time we attempt to create a read or write event on that file descriptor:

              // evmap.c, evmap_io_add()
              if (ev->ev_events & EV_READ) {
                      if (++nread == 1)
                              res |= EV_READ;
              }
              if (ev->ev_events & EV_WRITE) {
                      if (++nwrite == 1)
                              res |= EV_WRITE;
              }
      
              [...]
      
              if (res) {
                      [...]
                      if (evsel->add(base, ev->ev_fd,
                              old, (ev->ev_events & EV_ET) | res, extra) == -1)
                              return (-1);
                      [...]
              }
      

      So when the same file descriptor is attempted to be used again by libevent for epoll() polling, the process will hang because reads or writes to that file descriptor are never noticed.

      This can be reproduced for example by running a test where the `verify()`-callback fails on the server side twice in a row: (note, the LIBPROCESS_IP below is set in order to induce a test failure, result may vary on your local network and ssl configuration)

      LIBPROCESS_IP=127.0.1.1 ./libprocess-tests --gtest_filter="*VerifyCertificate*" --gtest_repeat=2
      

      There is a chance that the issue described here is the same as the ominous "issues" described in MESOS-3008,

      Attachments

        Activity

          People

            Unassigned Unassigned
            bennoe Benno Evers
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: