Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 4.4.0
-
ghx-label-9
Description
In some test runs, most tests under test_query_log will start to fail on various conditions like
custom_cluster/test_query_log.py:452: in test_query_log_table_query_select_mt_dop "impala-server.completed-queries.written", 1, 60) common/impala_service.py:144: in wait_for_metric_value self.__metric_timeout_assert(metric_name, expected_value, timeout) common/impala_service.py:213: in __metric_timeout_assert assert 0, assert_string E AssertionError: Metric impala-server.completed-queries.written did not reach value 1 in 60s. E Dumping debug webpages in JSON format... E Dumped memz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/memz.json E Dumped metrics JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/metrics.json E Dumped queries JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/queries.json E Dumped sessions JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/sessions.json E Dumped threadz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/threadz.json E Dumped rpcz JSON to $IMPALA_HOME/logs/metric_timeout_diags_20240410_12:49:04/json/rpcz.json E Dumping minidumps for impalads/catalogds... E Dumped minidump for Impalad PID 3680802 E Dumped minidump for Impalad PID 3680805 E Dumped minidump for Impalad PID 3680809 E Dumped minidump for Catalogd PID 3680732
or
custom_cluster/test_query_log.py:921: in test_query_log_ignored_sqls assert len(sql_results.data) == 1, "query not found in completed queries table" E AssertionError: query not found in completed queries table E assert 0 == 1 E + where 0 = len([]) E + where [] = <tests.common.impala_connection.ImpylaHS2ResultSet object at 0xffffa00cc350>.data
One symptom that seems related to this is INSERT operations into sys.impala_query_log that start "UnregisterQuery()" but never finish (with "Query successfully unregistered").
We can identify cases like that with
for log in $(ag -l 'INSERT INTO sys.impala_query_log' impalad.*); do echo $log; for qid in $(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); do if ! ag "Query successfully unregistered: query_id=$qid" $log; then echo "$qid not unregistered"; fi; done; done
A similar case may occur with creating the table too
for log in $(ag -l 'CREATE TABLE IF NOT EXISTS sys.impala_query_log' impalad.impala-ec2-rhel88-m7g-4xlarge-ondemand-0a5a.vpc.cloudera.com.jenkins.log.INFO.20240410-*); do QID=$(ag -o '[0-9a-f]*:[0-9a-f]*\] Analyzing query: INSERT INTO sys.impala_query_log' $log | cut -d']' -f1); echo $log; ag "Query successfully unregistered: query_id=$QID" $log; done
although these frequently fail because the test completes and shuts down Impala before the CREATE TABLE query completes.
Tracking one of those cases led to catalogd errors that repeated for 1m27s before the test suite restarted catalogd:
W0410 12:48:05.051760 3681790 Tasks.java:456] 6647229faf7637d5:3ec7565b00000000] Retrying task after failure: Waiting for lock on table sys.impala_query_log Java exception follows: org.apache.iceberg.hive.MetastoreLock$WaitingForLockException: Waiting for lock on table sys.impala_query_log at org.apache.iceberg.hive.MetastoreLock.lambda$acquireLock$1(MetastoreLock.java:217) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) at org.apache.iceberg.hive.MetastoreLock.acquireLock(MetastoreLock.java:209) at org.apache.iceberg.hive.MetastoreLock.lock(MetastoreLock.java:146) at org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:194) at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:135) at org.apache.iceberg.BaseTransaction.lambda$commitSimpleTransaction$3(BaseTransaction.java:417) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196) at org.apache.iceberg.BaseTransaction.commitSimpleTransaction(BaseTransaction.java:413) at org.apache.iceberg.BaseTransaction.commitTransaction(BaseTransaction.java:308) at org.apache.impala.service.CatalogOpExecutor.updateCatalogImpl(CatalogOpExecutor.java:7220) at org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:6939) at org.apache.impala.service.JniCatalog.lambda$updateCatalog$15(JniCatalog.java:504) at org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100) at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:245) at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:259) at org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:503)
This seems to correspond to the first test failure and persisted through all tests afterwards. So this lock was likely acquired from HMS, then never released.
Attachments
Issue Links
- is related to
-
IMPALA-12768 Set iceberg.engine.hive.lock-enabled To false By Default
- Open