Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10047

Performance regression on short queries due to IMPALA-6984 fix

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 4.0.0
    • Impala 4.0.0
    • Backend
    • None

    Description

      When doing some TPC-DS benchmarking with mt_dop, we encountered intermittent performance regressions on short queries. Some query executions seem to be taking an extra 10 seconds in exec status reports due to delays in sending a cancel RPC. From the coordinator logs:

       

      W0804 02:52:33.922088   108 rpcz_store.cc:253] Call impala.ControlService.CancelQueryFInstances from 127.0.0.1:46738 (request call id 3134) took 10007 ms (10 s). Client timeout 10000 ms (10 s)
      W0804 02:52:33.922143   108 rpcz_store.cc:259] Trace:
      0804 02:52:23.914291 (+     0us) impala-service-pool.cc:170] Inserting onto call queue
      0804 02:52:33.922079 (+10007788us) impala-service-pool.cc:255] Skipping call since client already timed out
      0804 02:52:33.922086 (+     7us) inbound_call.cc:162] Queueing failure response
      Metrics: {}
      I0804 02:52:33.922214   101 connection.cc:730] Got response to call id 3134 after client already timed out or cancelled
      I0804 02:52:33.923286 20276 coordinator-backend-state.cc:889] query_id=f442e73a0d35c136:c9993d7700000000 target backend=xx.xx.xx.xx:27000: Sending CancelQueryFInstances rpc

      The rpcz page also shows that some ReportExecStatus RPCs are taking 10 seconds:

       

       

      "incoming_queue_time": "Count: 671901, min / max: 1000.000ns / 10s347ms, 25th %-ile: 12.000us, 50th %-ile: 18.000us, 75th %-ile: 28.000us, 90th %-ile: 67.000us, 95th %-ile: 456.000us, 99.9th %-ile: 10s133ms",
      {
      		"method_name": "ReportExecStatus",
      		"handler_latency": "Count: 169653, min / max: 38.000us / 10s173ms, 25th %-ile: 9.024ms, 50th %-ile: 20.352ms, 75th %-ile: 35.840ms, 90th %-ile: 94.720ms, 95th %-ile: 177.152ms, 99.9th %-ile: 10s027ms",
      		"payload_size": "Count: 169653, min / max: 5.81 KB / 3.81 MB, 25th %-ile: 425.00 KB, 50th %-ile: 760.00 KB, 75th %-ile: 1.47 MB, 90th %-ile: 1.96 MB, 95th %-ile: 2.31 MB, 99.9th %-ile: 3.73 MB"
      	}]

       

      IMPALA-6984 introduced a Coordinator::CancelBackends() call to Coordinator::HandleExecStateTransition() for the ExecState::RETURNED_RESULTS case:

        if (new_state == ExecState::RETURNED_RESULTS) {
          // Cancel all backends, but wait for the final status reports to be received so that
          // we have a complete profile for this successful query.
          CancelBackends(/*fire_and_forget=*/ false);
          WaitForBackends();
        } else {
          CancelBackends(/*fire_and_forget=*/ true);
        }

      Removing this call eliminates the performance regression, so it will need more investigation.

       

      Attachments

        Activity

          People

            joemcdonnell Joe McDonnell
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: