Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
Impala 3.2.0
-
ghx-label-9
Description
On a recent S3 test run test_shutdown_executor hit a timeout waiting for a query to reach state FINISHED. Instead the query stays at state 5 (EXCEPTION).
12:51:11 __________________ TestShutdownCommand.test_shutdown_executor __________________ 12:51:11 custom_cluster/test_restart_services.py:209: in test_shutdown_executor 12:51:11 assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle) == 3 12:51:11 custom_cluster/test_restart_services.py:356: in __fetch_and_get_num_backends 12:51:11 self.client.QUERY_STATES['FINISHED'], timeout=20) 12:51:11 common/impala_service.py:267: in wait_for_query_state 12:51:11 target_state, query_state) 12:51:11 E AssertionError: Did not reach query state in time target=4 actual=5
From the logs I can see that the query fails because one of the executors becomes unreachable:
I1204 12:31:39.954125 5609 impala-server.cc:1792] Query a34c3a84775e5599:b2b25eb900000000: Failed due to unreachable impalad(s): jenkins-worker:22001
The query was select count(*) from functional_parquet.alltypes where sleep(1) = bool_col.
It seems that the query took longer than expected and was still running when the executor shut down.
I can reproduce by adding a sleep to the test:
diff --git a/tests/custom_cluster/test_restart_services.py b/tests/custom_cluster/test_restart_services.py index e441cbc..32bc8a1 100644 --- a/tests/custom_cluster/test_restart_services.py +++ b/tests/custom_cluster/test_restart_services.py @@ -206,7 +206,7 @@ class TestShutdownCommand(CustomClusterTestSuite, HS2TestSuite): after_shutdown_handle = self.__exec_and_wait_until_running(QUERY) # Finish executing the first query before the backend exits. - assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle) == 3 + assert self.__fetch_and_get_num_backends(QUERY, before_shutdown_handle, delay=5) == 3 # Wait for the impalad to exit, then start it back up and run another query, which # should be scheduled on it again. @@ -349,11 +349,14 @@ class TestShutdownCommand(CustomClusterTestSuite, HS2TestSuite): self.client.QUERY_STATES['RUNNING'], timeout=20) return handle - def __fetch_and_get_num_backends(self, query, handle): + def __fetch_and_get_num_backends(self, query, handle, delay=0): """Fetch the results of 'query' from the beeswax handle 'handle', close the query and return the number of backends obtained from the profile.""" self.impalad_test_service.wait_for_query_state(self.client, handle, self.client.QUERY_STATES['FINISHED'], timeout=20) + if delay > 0: + LOG.info("sleeping for {0}".format(delay)) + time.sleep(delay) self.client.fetch(query, handle) profile = self.client.get_runtime_profile(handle) self.client.close_query(handle)