Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6662

Make stress test resilient to hangs due to client crashes

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • Impala 3.2.0
    • Infrastructure
    • None

    Description

      The concurrent_select.py process starts multiple sub processes (called query runners), to run the queries. It also starts 2 threads called the query producer thread and the query consumer thread. The query producer thread adds queries to a query queue and the query consumer thread pulls off the queue and feeds the queries to the query runners.

      The query runner, once it gets queries, does the following:

      (pseudo code. Real code here: https://github.com/apache/impala/blob/d49f629c447ea59ad73ceeb0547fde4d41c651d1/tests/stress/concurrent_select.py#L583-L595)
      
      with _submit_query_lock:
          increment(num_queries_started)
      run_query()    # One runner crashes here.
      increment(num_queries_finished)
      
      

      One of the runners crash inside run_query(), thereby never incrementing num_queries_finished.

      Another thread that's supposed to check for memory leaks (but actually doesn't), periodically acquires '_submit_query_lock' and waits for the number of running queries to reach 0 before releasing the lock:
      https://github.com/apache/impala/blob/d49f629c447ea59ad73ceeb0547fde4d41c651d1/tests/stress/concurrent_select.py#L449-L511

      However, in the above case, the number of running queries will never reach 0 because one of the query runners hasn't incremented 'num_queries_finished' and exited. Therefore, the poll_mem_usage() function will hold the lock indefinitely, causing no new queries to be submitted, nor the stress test to complete running.

      Attachments

        Issue Links

          Activity

            People

              tarmstrong Tim Armstrong
              sailesh Sailesh Mukil
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: