Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
ghx-label-8
Description
In QueryDriver::TryQueryRetry, we use fetched_rows to check whether the client has fetched any results. If true, the retry will be skipped. Code snipper:
lock_guard<mutex> l(*client_request_state->lock()); // Queries can only be retried if no rows for the query have been fetched // (IMPALA-9225). if (client_request_state->fetched_rows()) { string err_msg = Substitute("Skipping retry of query_id=$0 because the client has " "already fetched some rows", PrintId(query_id)); VLOG_QUERY << err_msg; error->AddDetail(err_msg); return; }
However, it's possible that fetched_rows is true but the client still fetches nothing due to timeout. For example, the following query takes more than 10s to materialize the first row batch after it comes into FINISHED state:
select * from functional.alltypes where bool_col = sleep(600)
stakiar points out that fetched_rows is protected by ClientRequestState::lock_ (held in TryQueryRetry), while num_rows_fetched_ is only protected by ClientRequestState::fetch_rows_lock_ which is not held in TryQueryRetry. We need to sort out the locking logic to switch to use num_rows_fetched.