Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-3217

Limit the amount of time spent by SASL method run by WrapSaslCall

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • master, security, tserver
    • None

    Description

      There were reports on connection negotiation issues in a secure Kudu cluster: Kudu clients (e.g., kudu cluster ksck tool) would fail to establish a connection to a tablet server or master. The issue happened rather rarely, but in a busy cluster that was a major nuisance because the target server would not accept any new connections for a very long time, and usually the solution was to restart the server (kudu-tserver or kudu-master correspondingly).

      The stack traces collected from the diagnostic files pointed to a situation where one negotiation thread acquired the mutex in WrapSaslCall() in read mode and stuck for a long time (several minutes), while the Kerberos creds renewal thread was waiting on the lock to be acquired in write mode. Consequently, all other connection negotiation threads were blocked on the same mutex after that since the mutex is of RWMutex::Priority::PREFER_WRITING priority.

      The stacks of the related threads looked like the following:

        tids=[380992]
            0x7f060e1e45e0 <unknown>
            0x7f060c4afa3d __GI___poll
            0x7f0605813ffa <unknown>
            0x7f06058147b5 <unknown>
            0x7f0605814b07 sss_pac_make_request
            0x7f0605813dab <unknown>
            0x7f060df31deb <unknown>
            0x7f060df53661 <unknown>
            0x7f060df5407a krb5_rd_req_decoded
            0x7f060d3d190c <unknown>
            0x7f060d3d2dea <unknown>
            0x7f060d3d2f49 <unknown>
            0x7f060d3c0606 gss_accept_sec_context
            0x7f060765c04c <unknown>
            0x7f060d610b9b sasl_server_step
            0x7f060d611109 sasl_server_start
        tids=[170406,168067,167927,170163,168917,169251]
            0x7f060e1e45e0 <unknown>
            0x7f060e1e0092 __GI___pthread_rwlock_rdlock
                 0x1d09883 kudu::rpc::WrapSaslCall()
                 0x1d0bd21 kudu::rpc::ServerNegotiation::InitSaslServer()
                 0x1d112f7 kudu::rpc::ServerNegotiation::AuthenticateBySasl()
                 0x1d13096 kudu::rpc::ServerNegotiation::Negotiate()
                 0x1d2420c kudu::rpc::Negotiation::RunNegotiation()
                 0x1cfa905 kudu::internal::Invoker<>::Run()
                 0x1e9234f kudu::ThreadPool::DispatchThread()
                 0x1e88014 kudu::Thread::SuperviseThread()
            0x7f060e1dce25 start_thread
            0x7f060c4ba34d __clone
        tids=[380520]
            0x7f060e1e45e0 <unknown>
            0x7f060e1e02ac __GI___pthread_rwlock_wrlock
                 0x1d3b737 kudu::security::(anonymous namespace)::RenewThread()
                 0x1e88014 kudu::Thread::SuperviseThread()
            0x7f060e1dce25 start_thread
            0x7f060c4ba34d __clone
      

      Thread 380992 is the thread that acquired the mutex as a reader and stuck in a SASL call (the latter went through the SSSD PAC plugin). Thread 380520 is the Kerberos creds renewal thread, trying to acquire the mutex as a writer. The rest are connection negotiation threads trying to acquire the lock as readers.

      Further investigation revealed an issue in SSSD, where the stack of the stuck thread looks exactly the same as the stack of 380992 (the latter didn't have debug symbols to show information on every function in the stack):

      #0  0x00007f29342dcdfd in poll () from /lib64/libc.so.6
      #1  0x00007f2901e722ba in sss_cli_make_request_nochecks () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
      #2  0x00007f2901e72a75 in sss_cli_check_socket () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
      #3  0x00007f2901e72e07 in sss_pac_make_request () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
      #4  0x00007f2901e71feb in sssdpac_verify () from /usr/lib64/krb5/plugins/authdata/sssd_pac_plugin.so
      #5  0x00007f29364ea3d3 in krb5int_authdata_verify () from /lib64/libkrb5.so.3
      #6  0x00007f293650b621 in rd_req_decoded_opt () from /lib64/libkrb5.so.3
      #7  0x00007f293650c03a in krb5_rd_req_decoded () from /lib64/libkrb5.so.3
      #8  0x00007f292d592b3f in kg_accept_krb5 () from /lib64/libgssapi_krb5.so.2
      #9  0x00007f292d5941fa in krb5_gss_accept_sec_context_ext () from /lib64/libgssapi_krb5.so.2
      #10 0x00007f292d594359 in krb5_gss_accept_sec_context () from /lib64/libgssapi_krb5.so.2
      #11 0x00007f292d5816d6 in gss_accept_sec_context () from /lib64/libgssapi_krb5.so.2
      #12 0x00007f292d7c3edc in gssapi_server_mech_step () from /usr/lib64/sasl2/libgssapiv2.so
      #13 0x00007f29349e5b9b in sasl_server_step () from /lib64/libsasl2.so.3
      #14 0x00007f29349e6109 in sasl_server_start () from /lib64/libsasl2.so.3
      

      Given that there might be many other bugs in that path and a KDC might be slow to respond to a particular request, it would be great to limit the amount of time spent by the SASL call run by WrapSaslCall(). If it's over the limit, the code would return Status::TimedOut() or Status::ServiceUnavailable() status and the client side could handle the response appropriately, but at least Kudu masters and tablet server would be able to accept new connections and handle those new requests in a timely manner.

      Also, it doesn't seem like a very good idea to acquire a lock and issue a SASL call since the latter is often turns to be a remote call.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            aserbin Alexey Serbin

            Dates

              Created:
              Updated:

              Slack

                Issue deployment