Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2567 KRPC milestone 1
  3. IMPALA-5557

Disable rpc_default_keepalive_time_ms

    XMLWordPrintableJSON

Details

    • ghx-label-2

    Description

      Queries were still hitting the KDC hard and eventually failed, this is the sequence of events:

      1. Start running a query
      2. Impala backends send thousands of TGS_REQ requests to the KDC
      3. Query would appear to make some progress
      4. Some connections succeed others fail with Timeout exceeded waiting to connect
      5. Then I noticed in the logs that idle connections get kill after 65 seconds
        reactor.cc:281] Timing out connection server connection from 10.17.229.14:41804 - it has been idle for 65.0002s
      6. Query takes about 2 minutes and fail
      7. By then most connections are released since they were idle for > 65 seconds
      8. New queries go through the same process again and eventually fail

      In order to reliably run on a large cluster I believe we need to:

      1. Change the lifetime of idle connections, possibly extend it to ticket lifetime?
      2. Create new connections before existing ones expire in a staggered fashion to avoid KDC related failures?

      This is the flag rpc_default_keepalive_time_ms

      Attachments

        Issue Links

          Activity

            People

              kwho Michael Ho
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: