I observed that under load the check in https://gerrit.cloudera.org/#/c/12530/8/tests/query_test/test_cancellation.py@229 would sometimes fail. After some poking it appears that the client's cancellation request does not update the operation state (because "cause", the third parameter to Cancelnternal, defaults to nullptr). Then the client disconnects and the connection close triggers another call to CancelInternal, this time with cause = "Session closed". This is also what shows up in the profile. If the server is under load, handling the closing connection can take long enough so that a concurrent request for the state still returns RUNNING. The aim of this fix was to make CancelOperation synchronous, i.e. when it returns the subsequent call to GetOperationStatus would not return RUNNING.
The comment on ClientRequestState::Cancel() suggests that this is omitted when handling client cancellation because we expect that to happen regularly for finished queries. In that case however, eos_ would be set and we wouldn't update the state either, no?
/// Cancels the child queries and the coordinator with the given cause.
/// If cause is NULL, it assume this was deliberately cancelled by the user while in
/// FINISHED operation state. Otherwise, sets state to ERROR_STATE (TODO: IMPALA-1262:
/// use CANCELED_STATE). Does nothing if the query has reached EOS or already cancelled.