Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11768

EventProcessor should ignore failures on non-existing dbs/tables

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • Impala 4.3.0
    • Catalog
    • None

    Description

      We recently saw test failures in https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/6697/

      metadata.test_event_processing.TestEventProcessing.test_event_based_replication
      metadata.test_event_processing.TestEventProcessing.test_empty_partition_events
      metadata.test_event_processing.TestEventProcessing.test_transactional_insert_events
      metadata.test_event_processing.TestEventProcessing.test_insert_events
      metadata.test_event_processing.TestEventProcessing.test_empty_partition_events_transactional
      

      The errors are similar:

      metadata/test_event_processing.py:179: in test_event_based_replication
          self.__run_event_based_replication_tests()
      metadata/test_event_processing.py:194: in __run_event_based_replication_tests
          EventProcessorUtils.wait_for_event_processing(self)
      util/event_processor_utils.py:65: in wait_for_event_processing
          within {1} seconds".format(current_event_id, timeout))
      E   Exception: Event processor did not sync till last known event id 21676         within 10 seconds
      

      Stderr:

      SET client_identifier=metadata/test_event_processing.py::TestEventProcessing::()::test_event_based_replication;
      -- connecting to: localhost:21000
      -- connecting to localhost:21050 with impyla
      -- 2022-11-29 14:10:14,332 INFO     MainThread: Closing active operation
      -- connecting to localhost:28000 with impyla
      -- 2022-11-29 14:10:14,346 INFO     MainThread: Closing active operation
      -- 2022-11-29 14:10:19,364 INFO     MainThread: Waiting until events processor syncs to event id:21676
      

      Looking into the catalogd logs, the event-processor is already in the NEEDS_INVALIDATE state so it's paused:

      W1129 14:10:09.401335    92 MetastoreEventsProcessor.java:844] Event processing is skipped since status is NEEDS_INVALIDATE. Last synced event id is 21538
      

      The last event it processed is an INSERT event. The metadata reloading failed:

      I1129 14:10:07.365427    92 MetastoreEventsProcessor.java:805] Received 36 events. Start event id : 21538
      I1129 14:10:07.367424    92 MetastoreEvents.java:631] EventId: 21552 EventType: DROP_DATABASE Creating event 21552 of type DROP_DATABASE on database test_views_ddl_b1c33615
      I1129 14:10:07.371448    92 MetastoreEvents.java:631] EventId: 21573 EventType: DROP_DATABASE Creating event 21573 of type DROP_DATABASE on database test_compute_stats_9c95c5d8
      I1129 14:10:07.371696    92 MetastoreEvents.java:281] Total number of events received: 36 Total number of events filtered out: 0
      I1129 14:10:07.371767    92 MetastoreEvents.java:285] Incremented skipped metric to 2631
      I1129 14:10:07.371951    92 HdfsTable.java:2744] Reloading partition metadata: test_alter_table_hdfs_caching_95073aa2.i1670e_alter x=1 (INSERT event)
      W1129 14:10:07.381459    92 RetryingMetaStoreClient.java:269] MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getPartitionsByNames
      Java exception follows:
      org.apache.thrift.TApplicationException: Internal error processing get_partitions_by_names_req
      ...
      

      It then fails in the retry, and needs manual invalidate metadata:

      I1129 14:10:08.381718    92 RetryingMetaStoreClient.java:185] RetryingMetaStoreClient trying reconnect as ubuntu (auth:SIMPLE)
      ...
      I1129 14:10:08.395550    92 HiveMetaStoreClient.java:821] Connected to metastore.
      I1129 14:10:08.400802    92 MetastoreEventsProcessor.java:1026] Time elapsed in processing event batch: 1s035ms
      E1129 14:10:08.401016    92 MetastoreEventsProcessor.java:859] Event processing needs a invalidate command to resolve the state
      Java exception follows:
      org.apache.impala.catalog.events.MetastoreNotificationNeedsInvalidateException: EventId: 21539 EventType: INSERT Refresh partition on table {} partition {} failed. Event processing cannot continue. Issue an invalidate metadata command to reset the event processor state.
              at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.processPartitionInserts(MetastoreEvents.java:1298)
              at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.process(MetastoreEvents.java:1277)
              at org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:514)
              at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1013)
              at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:851)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.impala.catalog.CatalogException: Unexpected error while retrieving partitions for table test_alter_table_hdfs_caching_95073aa2.i1670e_alter
              at org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames(HdfsTable.java:2768)
              at org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExist(CatalogOpExecutor.java:4501)
              at org.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions(MetastoreEvents.java:879)
              at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.processPartitionInserts(MetastoreEvents.java:1295)
              ... 11 more
      Caused by: org.apache.thrift.TApplicationException: Internal error processing get_partitions_by_names_req
              at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
              at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_by_names_req(ThriftHiveMetastore.java:4162)
              at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_by_names_req(ThriftHiveMetastore.java:4149)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNamesInternal(HiveMetaStoreClient.java:2500)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2491)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2468)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2445)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2409)
              at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:212)
              at com.sun.proxy.$Proxy11.getPartitionsByNames(Unknown Source)
              at org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames(HdfsTable.java:2749)
              ... 14 more
      

      In HMS logs, the internal error is that the table is dropped:

      2022-11-29T14:10:08,398 ERROR [TThreadPoolServer WorkerProcess-28] metastore.RetryingHMSHandler: InvalidObjectException(message:hive.test_alter_table_hdfs_caching_95073aa2.i1670e_alter table not found)
      ...
      2022-11-29T14:10:08,398 ERROR [TThreadPoolServer WorkerProcess-28] thrift.ProcessFunction: Internal error processing get_partitions_by_names_req
      org.apache.hadoop.hive.metastore.api.InvalidObjectException: hive.test_alter_table_hdfs_caching_95073aa2.i1670e_alter table not found
      ...
      

      It's dropped by a concurrent DROP DATABASE CASCADE command. In the catalogd logs:

      I1129 14:10:07.005285   104 JniCatalog.java:257] execDdl request: DROP_DATABASE test_alter_table_hdfs_caching_95073aa2 issued by ubuntu
      ...
      I1129 14:10:07.387348   104 CatalogServiceCatalog.java:3290] Should wait for next update for DATABASE:test_alter_table_hdfs_caching_95073aa2: older version 4077 is sent. Expects a version >= 4727.
      

      When failed to process the event, the event-processor should doulbe check whether the target (db/table) still exists in HMS in case of such concurrent drop operations. If the target no longer exists, just skip the failure and invalidate the corresponding catalog object.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            hemanth619 Sai Hemanth Gantasala
            stigahuang Quanlong Huang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment