Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
ghx-label-10
Description
We recently saw test failures in https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/6697/
metadata.test_event_processing.TestEventProcessing.test_event_based_replication metadata.test_event_processing.TestEventProcessing.test_empty_partition_events metadata.test_event_processing.TestEventProcessing.test_transactional_insert_events metadata.test_event_processing.TestEventProcessing.test_insert_events metadata.test_event_processing.TestEventProcessing.test_empty_partition_events_transactional
The errors are similar:
metadata/test_event_processing.py:179: in test_event_based_replication self.__run_event_based_replication_tests() metadata/test_event_processing.py:194: in __run_event_based_replication_tests EventProcessorUtils.wait_for_event_processing(self) util/event_processor_utils.py:65: in wait_for_event_processing within {1} seconds".format(current_event_id, timeout)) E Exception: Event processor did not sync till last known event id 21676 within 10 seconds
Stderr:
SET client_identifier=metadata/test_event_processing.py::TestEventProcessing::()::test_event_based_replication; -- connecting to: localhost:21000 -- connecting to localhost:21050 with impyla -- 2022-11-29 14:10:14,332 INFO MainThread: Closing active operation -- connecting to localhost:28000 with impyla -- 2022-11-29 14:10:14,346 INFO MainThread: Closing active operation -- 2022-11-29 14:10:19,364 INFO MainThread: Waiting until events processor syncs to event id:21676
Looking into the catalogd logs, the event-processor is already in the NEEDS_INVALIDATE state so it's paused:
W1129 14:10:09.401335 92 MetastoreEventsProcessor.java:844] Event processing is skipped since status is NEEDS_INVALIDATE. Last synced event id is 21538
The last event it processed is an INSERT event. The metadata reloading failed:
I1129 14:10:07.365427 92 MetastoreEventsProcessor.java:805] Received 36 events. Start event id : 21538 I1129 14:10:07.367424 92 MetastoreEvents.java:631] EventId: 21552 EventType: DROP_DATABASE Creating event 21552 of type DROP_DATABASE on database test_views_ddl_b1c33615 I1129 14:10:07.371448 92 MetastoreEvents.java:631] EventId: 21573 EventType: DROP_DATABASE Creating event 21573 of type DROP_DATABASE on database test_compute_stats_9c95c5d8 I1129 14:10:07.371696 92 MetastoreEvents.java:281] Total number of events received: 36 Total number of events filtered out: 0 I1129 14:10:07.371767 92 MetastoreEvents.java:285] Incremented skipped metric to 2631 I1129 14:10:07.371951 92 HdfsTable.java:2744] Reloading partition metadata: test_alter_table_hdfs_caching_95073aa2.i1670e_alter x=1 (INSERT event) W1129 14:10:07.381459 92 RetryingMetaStoreClient.java:269] MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getPartitionsByNames Java exception follows: org.apache.thrift.TApplicationException: Internal error processing get_partitions_by_names_req ...
It then fails in the retry, and needs manual invalidate metadata:
I1129 14:10:08.381718 92 RetryingMetaStoreClient.java:185] RetryingMetaStoreClient trying reconnect as ubuntu (auth:SIMPLE) ... I1129 14:10:08.395550 92 HiveMetaStoreClient.java:821] Connected to metastore. I1129 14:10:08.400802 92 MetastoreEventsProcessor.java:1026] Time elapsed in processing event batch: 1s035ms E1129 14:10:08.401016 92 MetastoreEventsProcessor.java:859] Event processing needs a invalidate command to resolve the state Java exception follows: org.apache.impala.catalog.events.MetastoreNotificationNeedsInvalidateException: EventId: 21539 EventType: INSERT Refresh partition on table {} partition {} failed. Event processing cannot continue. Issue an invalidate metadata command to reset the event processor state. at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.processPartitionInserts(MetastoreEvents.java:1298) at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.process(MetastoreEvents.java:1277) at org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:514) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1013) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:851) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.impala.catalog.CatalogException: Unexpected error while retrieving partitions for table test_alter_table_hdfs_caching_95073aa2.i1670e_alter at org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames(HdfsTable.java:2768) at org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExist(CatalogOpExecutor.java:4501) at org.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions(MetastoreEvents.java:879) at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.processPartitionInserts(MetastoreEvents.java:1295) ... 11 more Caused by: org.apache.thrift.TApplicationException: Internal error processing get_partitions_by_names_req at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_by_names_req(ThriftHiveMetastore.java:4162) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_by_names_req(ThriftHiveMetastore.java:4149) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNamesInternal(HiveMetaStoreClient.java:2500) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2491) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2468) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2445) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2409) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:212) at com.sun.proxy.$Proxy11.getPartitionsByNames(Unknown Source) at org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames(HdfsTable.java:2749) ... 14 more
In HMS logs, the internal error is that the table is dropped:
2022-11-29T14:10:08,398 ERROR [TThreadPoolServer WorkerProcess-28] metastore.RetryingHMSHandler: InvalidObjectException(message:hive.test_alter_table_hdfs_caching_95073aa2.i1670e_alter table not found) ... 2022-11-29T14:10:08,398 ERROR [TThreadPoolServer WorkerProcess-28] thrift.ProcessFunction: Internal error processing get_partitions_by_names_req org.apache.hadoop.hive.metastore.api.InvalidObjectException: hive.test_alter_table_hdfs_caching_95073aa2.i1670e_alter table not found ...
It's dropped by a concurrent DROP DATABASE CASCADE command. In the catalogd logs:
I1129 14:10:07.005285 104 JniCatalog.java:257] execDdl request: DROP_DATABASE test_alter_table_hdfs_caching_95073aa2 issued by ubuntu ... I1129 14:10:07.387348 104 CatalogServiceCatalog.java:3290] Should wait for next update for DATABASE:test_alter_table_hdfs_caching_95073aa2: older version 4077 is sent. Expects a version >= 4727.
When failed to process the event, the event-processor should doulbe check whether the target (db/table) still exists in HMS in case of such concurrent drop operations. If the target no longer exists, just skip the failure and invalidate the corresponding catalog object.
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-11968 test_event_based_replication fails with timeout
- Open
-
IMPALA-11816 Flaky TestEventProcessing tests
- Open