Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10047

Performance regression on short queries due to IMPALA-6984 fix

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • Impala 4.0.0
    • Impala 4.0.0
    • Backend
    • None

    Description

      When doing some TPC-DS benchmarking with mt_dop, we encountered intermittent performance regressions on short queries. Some query executions seem to be taking an extra 10 seconds in exec status reports due to delays in sending a cancel RPC. From the coordinator logs:

       

      W0804 02:52:33.922088   108 rpcz_store.cc:253] Call impala.ControlService.CancelQueryFInstances from 127.0.0.1:46738 (request call id 3134) took 10007 ms (10 s). Client timeout 10000 ms (10 s)
      W0804 02:52:33.922143   108 rpcz_store.cc:259] Trace:
      0804 02:52:23.914291 (+     0us) impala-service-pool.cc:170] Inserting onto call queue
      0804 02:52:33.922079 (+10007788us) impala-service-pool.cc:255] Skipping call since client already timed out
      0804 02:52:33.922086 (+     7us) inbound_call.cc:162] Queueing failure response
      Metrics: {}
      I0804 02:52:33.922214   101 connection.cc:730] Got response to call id 3134 after client already timed out or cancelled
      I0804 02:52:33.923286 20276 coordinator-backend-state.cc:889] query_id=f442e73a0d35c136:c9993d7700000000 target backend=xx.xx.xx.xx:27000: Sending CancelQueryFInstances rpc

      The rpcz page also shows that some ReportExecStatus RPCs are taking 10 seconds:

       

       

      "incoming_queue_time": "Count: 671901, min / max: 1000.000ns / 10s347ms, 25th %-ile: 12.000us, 50th %-ile: 18.000us, 75th %-ile: 28.000us, 90th %-ile: 67.000us, 95th %-ile: 456.000us, 99.9th %-ile: 10s133ms",
      {
      		"method_name": "ReportExecStatus",
      		"handler_latency": "Count: 169653, min / max: 38.000us / 10s173ms, 25th %-ile: 9.024ms, 50th %-ile: 20.352ms, 75th %-ile: 35.840ms, 90th %-ile: 94.720ms, 95th %-ile: 177.152ms, 99.9th %-ile: 10s027ms",
      		"payload_size": "Count: 169653, min / max: 5.81 KB / 3.81 MB, 25th %-ile: 425.00 KB, 50th %-ile: 760.00 KB, 75th %-ile: 1.47 MB, 90th %-ile: 1.96 MB, 95th %-ile: 2.31 MB, 99.9th %-ile: 3.73 MB"
      	}]

       

      IMPALA-6984 introduced a Coordinator::CancelBackends() call to Coordinator::HandleExecStateTransition() for the ExecState::RETURNED_RESULTS case:

        if (new_state == ExecState::RETURNED_RESULTS) {
          // Cancel all backends, but wait for the final status reports to be received so that
          // we have a complete profile for this successful query.
          CancelBackends(/*fire_and_forget=*/ false);
          WaitForBackends();
        } else {
          CancelBackends(/*fire_and_forget=*/ true);
        }

      Removing this call eliminates the performance regression, so it will need more investigation.

       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            joemcdonnell Joe McDonnell
            joemcdonnell Joe McDonnell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment