Affects Version/s: Impala 2.11.0
Fix Version/s: None
Environment:I managed to investigate this issue only once so far, it was hanging in some of our Jenkins build jobs.
1) Runs a query in Impala shell that goes quickly to fetching state, where the fetching would take several minutes.
2) While the query is running, the script polls the Impala debug page to wait until the query gets to "FINISHED" state. This state means that the results are ready for fetching. (There is a 15 try threshold for the polling part.)
3) Once the query gets to "FINISHED" state a CTRL-C signal is sent to Impala shell to cancel the query.
4) Query output is fetched and verified.
My initial assumption on this issue was that the query somehow was stuck in step 2) while waiting for the desired query state (and the retry threshold wasn't applied somehow) but when I checked the Impala debug page, apparently the query had gone to completed from in-flight with having 2048 rows already fetched (see picture attached). Impala logs also show that the query had been cancelled.
This means that the step 2) and even step 3) had finished properly and the query was cancelled during the fetching phase.
The interesting part is when I checked the running processes on the host, I observed a running impala-shell.py that is executing the query.
I attached a gdb to the running process but the backtrace didn't give anything meaningful.
- The query shows completed on Impala debug page with a few lines had already been fetched (as desired).
- Impala logs show that the query had been cancelled (as desired).
- An impala_shell.py is still showing up in 'ps -ef' that seems to run the query.
- According to 'top' there is no process that pikes in cpu usage.
As the debug page shows that the query is completed I assume that the 'waiting for state' and the actual cancellation of the query finished successfully so the execution should hang on step 4) where the results are retrieved from ImpalaShell.
The get_result() contains a shell_process.communicate() call that fetches the stdout and stderr from the underlying process. According to the python docs on this communicate() function it seems that it doesn't work well when the data size is big.
Taking into account that this query fetches and prints results for more than 30 mins we can consider the stdout of the ImpalaShell large.
"Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited."
If this is indeed the root of the issue then the possible solution is to modify the util.py:ImpalaShell to judge based on an input parameter when calling Popen whether it connects to stdout wit Pipe or not connect to it at all. This would be suitable with this test as the stdout is not used at all, only the stderr is asserted on, so no need to get the stdout data as well from the ImpalaShell.