Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12997

test_query_log tests get stuck trying to write to the log

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 4.4.0
    • Impala 4.5.0
    • Catalog

    Description

      In some test runs, most tests under test_query_log will start to fail on various conditions like

      custom_cluster/test_query_log.py:452: in test_query_log_table_query_select_mt_dop
          "impala-server.completed-queries.written", 1, 60)
      common/impala_service.py:144: in wait_for_metric_value
          self.__metric_timeout_assert(metric_name, expected_value, timeout)
      common/impala_service.py:213: in __metric_timeout_assert
          assert 0, assert_string
      E   AssertionError: Metric impala-server.completed-queries.written did not reach value 1 in 60s.
      E   Dumping debug webpages in JSON format...
      E   Dumped memz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json
      E   Dumped metrics JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json
      E   Dumped queries JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json
      E   Dumped sessions JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json
      E   Dumped threadz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json
      E   Dumped rpcz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json
      E   Dumping minidumps for impalads/catalogds...
      E   Dumped minidump for Impalad PID 3680802
      E   Dumped minidump for Impalad PID 3680805
      E   Dumped minidump for Impalad PID 3680809
      E   Dumped minidump for Catalogd PID 3680732
      

      or

      custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls
          assert len(sql_results.data) == 1, "query not found in completed queries table"
      E   AssertionError: query not found in completed queries table
      E   assert 0 == 1
      E    +  where 0 = len([])
      E    +    where [] = <tests.common.impala_connection.ImpylaHS2ResultSet object at 0xffffa00cc350>.data
      

      One symptom that seems related to this is INSERT operations into sys.impala_query_log that start "UnregisterQuery()" but never finish (with "Query successfully unregistered").

      We can identify cases like that with

      for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; done; done
      

      A similar case may occur with creating the table too

      for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*); do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query successfully unregistered: query_id=$QID" $log; done
      

      although these frequently fail because the test completes and shuts down Impala before the CREATE TABLE query completes.

      Tracking one of those cases led to catalogd errors that repeated for 1m27s before the test suite restarted catalogd:

      W0410 12:48:05.051760 3681790 Tasks.java:456] 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for lock on table sys.impala_query_log
      Java exception follows:
      org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for lock on table sys.impala_query_log
      	at org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217)
      	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
      	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
      	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
      	at org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209)
      	at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146)
      	at org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194)
      	at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135)
      	at org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417)
      	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
      	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
      	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
      	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
      	at org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413)
      	at org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308)
      	at org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220)
      	at org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939)
      	at org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504)
      	at org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
      	at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
      	at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
      	at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
      	at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245)
      	at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259)
      	at org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
      

      This seems to correspond to the first test failure and persisted through all tests afterwards. So this lock was likely acquired from HMS, then never released.

      Attachments

        Issue Links

          Activity

            People

              MikaelSmith Michael Smith
              MikaelSmith Michael Smith
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: