Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6318

Test suite may hang on test_query_cancellation_during_fetch

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.11.0
    • Not Applicable
    • None
    • I managed to investigate this issue only once so far, it was hanging in some of our Jenkins build jobs.
    • ghx-label-3

    Description

      test_query_cancellation_during_fetch steps:
      1) Runs a query in Impala shell that goes quickly to fetching state, where the fetching would take several minutes.
      2) While the query is running, the script polls the Impala debug page to wait until the query gets to "FINISHED" state. This state means that the results are ready for fetching. (There is a 15 try threshold for the polling part.)
      3) Once the query gets to "FINISHED" state a CTRL-C signal is sent to Impala shell to cancel the query.
      4) Query output is fetched and verified.

      Initial assumption
      =============
      My initial assumption on this issue was that the query somehow was stuck in step 2) while waiting for the desired query state (and the retry threshold wasn't applied somehow) but when I checked the Impala debug page, apparently the query had gone to completed from in-flight with having 2048 rows already fetched (see picture attached). Impala logs also show that the query had been cancelled.

      I1209 08:29:35.281550 18194 coordinator.cc:99] Exec() query_id=d248bc6079f33f66:1b638a700000000 stmt=with v as (values (1 as x), (2), (3), (4)) select * from v, v v2, v v3, v v4, v v5, v v6, v v7, v v8, v v9, v v10, v v11
      
      I1209 08:29:35.895359 18196 query-state.cc:384] Instance completed. instance_id=d248bc6079f33f66:1b638a700000000 #in-flight=0 status=CANCELLED: Cancelled
      I1209 08:29:35.895372 18196 query-state.cc:396] Cancel: query_id=d248bc6079f33f66:1b638a700000000
      I1209 08:29:35.895407 18196 query-exec-mgr.cc:149] ReleaseQueryState(): query_id=d248bc6079f33f66:1b638a700000000 refcnt=2
      I1209 08:29:35.908305 18194 query-exec-mgr.cc:149] ReleaseQueryState(): query_id=d248bc6079f33f66:1b638a700000000 refcnt=1
      

      This means that the step 2) and even step 3) had finished properly and the query was cancelled during the fetching phase.

      The interesting part is when I checked the running processes on the host, I observed a running impala-shell.py that is executing the query.

      jenkins  18187  6223  0 Dec09 ?        00:00:00 <path_to_impala>/Impala/shell/impala_shell.py -i localhost:21000 -q with v as (values (1 as x), (2), (3), (4)) select * from v, v v2, v v3, v v4, v v5, v v6, v v7, v v8, v v9, v v10, v v11;
      

      I attached a gdb to the running process but the backtrace didn't give anything meaningful.

      Summary
      ============

      • The query shows completed on Impala debug page with a few lines had already been fetched (as desired).
      • Impala logs show that the query had been cancelled (as desired).
      • An impala_shell.py is still showing up in 'ps -ef' that seems to run the query.
      • According to 'top' there is no process that pikes in cpu usage.

      Assumption
      ============
      As the debug page shows that the query is completed I assume that the 'waiting for state' and the actual cancellation of the query finished successfully so the execution should hang on step 4) where the results are retrieved from ImpalaShell.

      1) p = ImpalaShell(args)
      2) self.wait_for_query_state(stmt, cancel_at_state)
      3) os.kill(p.pid(), signal.SIGINT)
      4) result = p.get_result()
      

      The get_result() contains a shell_process.communicate() call that fetches the stdout and stderr from the underlying process. According to the python docs on this communicate() function it seems that it doesn't work well when the data size is big.
      Taking into account that this query fetches and prints results for more than 30 mins we can consider the stdout of the ImpalaShell large.
      https://docs.python.org/2/library/subprocess.html
      "Note The data read is buffered in memory, so do not use this method if the data size is large or unlimited."

      If this is indeed the root of the issue then the possible solution is to modify the util.py:ImpalaShell to judge based on an input parameter when calling Popen whether it connects to stdout wit Pipe or not connect to it at all. This would be suitable with this test as the stdout is not used at all, only the stderr is asserted on, so no need to get the stdout data as well from the ImpalaShell.

      Attachments

        Activity

          People

            gaborkaszab Gabor Kaszab
            gaborkaszab Gabor Kaszab
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: