There were reports on connection negotiation issues in a secure Kudu cluster: Kudu clients (e.g., kudu cluster ksck tool) would fail to establish a connection to a tablet server or master. The issue happened rather rarely, but in a busy cluster that was a major nuisance because the target server would not accept any new connections for a very long time, and usually the solution was to restart the server (kudu-tserver or kudu-master correspondingly).
The stack traces collected from the diagnostic files pointed to a situation where one negotiation thread acquired the mutex in WrapSaslCall() in read mode and stuck for a long time (several minutes), while the Kerberos creds renewal thread was waiting on the lock to be acquired in write mode. Consequently, all other connection negotiation threads were blocked on the same mutex after that since the mutex is of RWMutex::Priority::PREFER_WRITING priority.
The stacks of the related threads looked like the following:
0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
Thread 380992 is the thread that acquired the mutex as a reader and stuck in a SASL call (the latter went through the SSSD PAC plugin). Thread 380520 is the Kerberos creds renewal thread, trying to acquire the mutex as a writer. The rest are connection negotiation threads trying to acquire the lock as readers.
Further investigation revealed an issue in SSSD, where the stack of the stuck thread looks exactly the same as the stack of 380992 (the latter didn't have debug symbols to show information on every function in the stack):
#0 0x00007f29342dcdfd in poll () from /lib64/libc.so.6
#1 0x00007f2901e722ba in sss_cli_make_request_nochecks () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#2 0x00007f2901e72a75 in sss_cli_check_socket () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#3 0x00007f2901e72e07 in sss_pac_make_request () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#4 0x00007f2901e71feb in sssdpac_verify () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
#5 0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
#6 0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
#7 0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
#8 0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
#9 0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from /lib64/libgssapi_krb5.so.2
#10 0x00007f292d594359 in krb5_gss_accept_sec_context () from /lib64/libgssapi_krb5.so.2
#11 0x00007f292d5816d6 in gss_accept_sec_context () from /lib64/libgssapi_krb5.so.2
#12 0x00007f292d7c3edc in gssapi_server_mech_step () from /usr/lib64/sasl2/libgssapiv2.so
#13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
#14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
Given that there might be many other bugs in that path and a KDC might be slow to respond to a particular request, it would be great to limit the amount of time spent by the SASL call run by WrapSaslCall(). If it's over the limit, the code would return Status::TimedOut() or Status::ServiceUnavailable() status and the client side could handle the response appropriately, but at least Kudu masters and tablet server would be able to accept new connections and handle those new requests in a timely manner.
Also, it doesn't seem like a very good idea to acquire a lock and issue a SASL call since the latter is often turns to be a remote call.