Details

    • Epic Color:
      ghx-label-2

      Description

      Queries were still hitting the KDC hard and eventually failed, this is the sequence of events:

      1. Start running a query
      2. Impala backends send thousands of TGS_REQ requests to the KDC
      3. Query would appear to make some progress
      4. Some connections succeed others fail with Timeout exceeded waiting to connect
      5. Then I noticed in the logs that idle connections get kill after 65 seconds
        reactor.cc:281] Timing out connection server connection from 10.17.229.14:41804 - it has been idle for 65.0002s
      6. Query takes about 2 minutes and fail
      7. By then most connections are released since they were idle for > 65 seconds
      8. New queries go through the same process again and eventually fail

      In order to reliably run on a large cluster I believe we need to:

      1. Change the lifetime of idle connections, possibly extend it to ticket lifetime?
      2. Create new connections before existing ones expire in a staggered fashion to avoid KDC related failures?

      This is the flag rpc_default_keepalive_time_ms

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                kwho Michael Ho
                Reporter:
                mmokhtar Mostafa Mokhtar
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: