Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-10338

TestAdmissionController.test_queue_reasons_slots flaky because of slow/hanging fragment

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 4.0.0
    • None
    • Backend
    • ghx-label-11

    Description

      This is on an s3 debug build, commit 5a00a4c06f8ec40a8867dcbc036cf5bb47b8a3be

      custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots (from pytest)
      Failing for the past 1 build (Since Failed#673 )
      Took 1 min 58 sec.
      add description
      Error Message
      
      Timeout: query 504b3a2511f3cd0e:e27bec6b00000000 did not reach one of the expected states [4], last known state 3
      
      Stacktrace
      
      custom_cluster/test_admission_controller.py:967: in test_queue_reasons_slots
          TIMEOUT_S, config_options={"mt_dop": 4})
      custom_cluster/test_admission_controller.py:277: in _execute_and_collect_profiles
          state = self.wait_for_any_state(handle, expected_states, timeout_s)
      common/impala_test_suite.py:1081: in wait_for_any_state
          actual_state))
      E   Timeout: query 504b3a2511f3cd0e:e27bec6b00000000 did not reach one of the expected states [4], last known state 3
      

      Those numbers are beeswax QUeryStates:

      enum QueryState {
        CREATED = 0
        INITIALIZED = 1
        COMPILED = 2
        RUNNING = 3
        FINISHED = 4
        EXCEPTION = 5
      }
      

      I.e. it appears to have run for > 60 seconds.

      The timing of that query:

      I1116 13:25:08.449323 32665 impala-server.cc:1242] 504b3a2511f3cd0e:e27bec6b00000000] Registered query query_id=504b3a2511f3cd0e:e27bec6b00000000 session_id=874c2c9eaf2ad730:0004c1bb0ba7b4a7
      I1116 13:25:08.449626 32665 Frontend.java:1532] 504b3a2511f3cd0e:e27bec6b00000000] Analyzing query: select min(ss_wholesale_cost) from tpcds_parquet.store_sales db: default
      ...
      I1116 13:25:08.567667   367 admission-controller.cc:1532] 504b3a2511f3cd0e:e27bec6b00000000] Scheduling query 504b3a2511f3cd0e:e27bec6b00000000 with membership version 2
      I1116 13:25:08.567767   367 admission-controller.cc:1590] 504b3a2511f3cd0e:e27bec6b00000000] Scheduling for executor group: default-pool-group1 with 3 executors
      I1116 13:25:08.643026   367 admission-controller.cc:1640] 504b3a2511f3cd0e:e27bec6b00000000] Trying to admit query to pool default-pool in executor group default-pool-group1 (3 executors)
      ...
      I1116 13:25:49.184185 32432 admission-controller.cc:1811] Admitting from queue: query=504b3a2511f3cd0e:e27bec6b00000000
      I1116 13:25:49.184196 32432 admission-controller.cc:1903] For Query 504b3a2511f3cd0e:e27bec6b00000000 per_backend_mem_limit set to: -1.00 B per_backend_m
      em_to_admit set to: 114.02 MB coord_backend_mem_limit set to: -1.00 B coord_backend_mem_to_admit set to: 114.02 MB
      I1116 13:25:49.184350   367 admission-controller.cc:1288] 504b3a2511f3cd0e:e27bec6b00000000] Admitted queued query id=504b3a2511f3cd0e:e27bec6b00000000
      I1116 13:25:49.184370   367 admission-controller.cc:1289] 504b3a2511f3cd0e:e27bec6b00000000] Final: agg_num_running=1, agg_num_queued=0, agg_mem_reserved
      =17.25 KB,  local_host(local_mem_admitted=342.05 MB, num_admitted_running=1, num_queued=0, backend_mem_reserved=17.25 KB, topN_query_stats: queries=[0a42
      32658ea48bc5:c015646900000000, 5b470ebea7782154:ea22bdb000000000, 554d88d8f812e22d:efbaf75200000000, aa4a301189a4d144:fdc17ff700000000], total_mem_consum
      ed=17.25 KB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=4, min=0, max=17.25 KB, pool_total_mem=17.25 KB, average_per_query=4.31 KB)
      I1116 13:25:49.185214   367 impala-server.cc:2062] 504b3a2511f3cd0e:e27bec6b00000000] Registering query locations
      I1116 13:25:49.185261   367 coordinator.cc:149] 504b3a2511f3cd0e:e27bec6b00000000] Exec() query_id=504b3a2511f3cd0e:e27bec6b00000000 stmt=select min(ss_w
      holesale_cost) from tpcds_parquet.store_sales
      I1116 13:25:49.186172   367 coordinator.cc:473] 504b3a2511f3cd0e:e27bec6b00000000] starting execution on 3 backends for query_id=504b3a2511f3cd0e:e27bec6
      b00000000
      I1116 13:25:49.189028 32071 control-service.cc:142] 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances(): query_id=504b3a2511f3cd0e:e27bec6b00000000 
      coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000 #instances=5
      I1116 13:25:49.192133   367 coordinator.cc:532] 504b3a2511f3cd0e:e27bec6b00000000] started execution on 3 backends for query_id=504b3a2511f3cd0e:e27bec6b
      00000000
      ...
      I1116 13:26:49.278456 32665 impala-server.cc:1320] UnregisterQuery(): query_id=504b3a2511f3cd0e:e27bec6b00000000
      I1116 13:26:49.278478 32665 coordinator.cc:672] ExecState: query id=504b3a2511f3cd0e:e27bec6b00000000 execution cancelled
      I1116 13:26:49.278493 32665 coordinator-backend-state.cc:921] query_id=504b3a2511f3cd0e:e27bec6b00000000 target backend=127.0.0.1:27000: Sending CancelQueryFInstances rpc
      ...
      
      

      As far as when the instances finished - 13:25:55

      tarmstrong@tarmstrong-box:~/tmplogs8/impala-asf-master-core-s3-DEBUG-673/logs/custom_cluster_tests/relevant$ grep -h -R 504b3a2511f3cd0e.*'Instance' *
      I1116 13:25:49.189028 32071 control-service.cc:142] 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances(): query_id=504b3a2511f3cd0e:e27bec6b00000000 coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000 #instances=5
      I1116 13:25:55.125044   640 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000002] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000002 #in-flight=4 status=OK
      I1116 13:25:55.136926   642 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000004] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000004 #in-flight=3 status=OK
      I1116 13:25:55.139536   639 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000001] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000001 #in-flight=2 status=OK
      I1116 13:25:55.144346   641 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000003] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000003 #in-flight=1 status=OK
      I1116 13:26:49.278493 32665 coordinator-backend-state.cc:921] query_id=504b3a2511f3cd0e:e27bec6b00000000 target backend=127.0.0.1:27000: Sending CancelQueryFInstances rpc
      I1116 13:26:49.278884 32665 coordinator-backend-state.cc:921] query_id=504b3a2511f3cd0e:e27bec6b00000000 target backend=127.0.0.1:27002: Sending CancelQueryFInstances rpc
      I1116 13:26:49.278980   643 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000000] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000000 #in-flight=0 status=CANCELLED: Cancelled
      I1116 13:25:49.190251 32005 control-service.cc:142] 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances(): query_id=504b3a2511f3cd0e:e27bec6b00000000 coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000 #instances=4
      I1116 13:25:55.015192   645 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000006] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000006 #in-flight=3 status=OK
      I1116 13:25:55.015666   646 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000007] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000007 #in-flight=2 status=OK
      I1116 13:25:55.026870   644 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000005] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000005 #in-flight=1 status=OK
      I1116 13:25:55.034817   647 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000008] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000008 #in-flight=0 status=OK
      I1116 13:25:49.191587 31720 control-service.cc:142] 504b3a2511f3cd0e:e27bec6b00000000] ExecQueryFInstances(): query_id=504b3a2511f3cd0e:e27bec6b00000000 coord=impala-ec2-centos74-m5-4xlarge-ondemand-018e.vpc.cloudera.com:27000 #instances=4
      I1116 13:25:55.641626   648 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b00000009] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b00000009 #in-flight=3 status=OK
      I1116 13:25:55.646078   649 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b0000000a] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b0000000a #in-flight=2 status=OK
      I1116 13:25:55.650281   652 query-state.cc:906] 504b3a2511f3cd0e:e27bec6b0000000b] Instance completed. instance_id=504b3a2511f3cd0e:e27bec6b0000000b #in-flight=1 status=OK
      

      Attachments

        1. failure-output.txt
          31 kB
          Tim Armstrong
        2. impalad.ERROR
          42 kB
          Tim Armstrong
        3. impalad.INFO
          340 kB
          Tim Armstrong

        Issue Links

          Activity

            People

              asherman Andrew Sherman
              tarmstrong Tim Armstrong
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: