Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-11768

EventProcessor should ignore failures on non-existing dbs/tables

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • Impala 4.3.0
    • Catalog
    • None

    Description

      We recently saw test failures in https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/6697/

      metadata.test_event_processing.TestEventProcessing.test_event_based_replication
      metadata.test_event_processing.TestEventProcessing.test_empty_partition_events
      metadata.test_event_processing.TestEventProcessing.test_transactional_insert_events
      metadata.test_event_processing.TestEventProcessing.test_insert_events
      metadata.test_event_processing.TestEventProcessing.test_empty_partition_events_transactional
      

      The errors are similar:

      metadata/test_event_processing.py:179: in test_event_based_replication
          self.__run_event_based_replication_tests()
      metadata/test_event_processing.py:194: in __run_event_based_replication_tests
          EventProcessorUtils.wait_for_event_processing(self)
      util/event_processor_utils.py:65: in wait_for_event_processing
          within {1} seconds".format(current_event_id, timeout))
      E   Exception: Event processor did not sync till last known event id 21676         within 10 seconds
      

      Stderr:

      SET client_identifier=metadata/test_event_processing.py::TestEventProcessing::()::test_event_based_replication;
      -- connecting to: localhost:21000
      -- connecting to localhost:21050 with impyla
      -- 2022-11-29 14:10:14,332 INFO     MainThread: Closing active operation
      -- connecting to localhost:28000 with impyla
      -- 2022-11-29 14:10:14,346 INFO     MainThread: Closing active operation
      -- 2022-11-29 14:10:19,364 INFO     MainThread: Waiting until events processor syncs to event id:21676
      

      Looking into the catalogd logs, the event-processor is already in the NEEDS_INVALIDATE state so it's paused:

      W1129 14:10:09.401335    92 MetastoreEventsProcessor.java:844] Event processing is skipped since status is NEEDS_INVALIDATE. Last synced event id is 21538
      

      The last event it processed is an INSERT event. The metadata reloading failed:

      I1129 14:10:07.365427    92 MetastoreEventsProcessor.java:805] Received 36 events. Start event id : 21538
      I1129 14:10:07.367424    92 MetastoreEvents.java:631] EventId: 21552 EventType: DROP_DATABASE Creating event 21552 of type DROP_DATABASE on database test_views_ddl_b1c33615
      I1129 14:10:07.371448    92 MetastoreEvents.java:631] EventId: 21573 EventType: DROP_DATABASE Creating event 21573 of type DROP_DATABASE on database test_compute_stats_9c95c5d8
      I1129 14:10:07.371696    92 MetastoreEvents.java:281] Total number of events received: 36 Total number of events filtered out: 0
      I1129 14:10:07.371767    92 MetastoreEvents.java:285] Incremented skipped metric to 2631
      I1129 14:10:07.371951    92 HdfsTable.java:2744] Reloading partition metadata: test_alter_table_hdfs_caching_95073aa2.i1670e_alter x=1 (INSERT event)
      W1129 14:10:07.381459    92 RetryingMetaStoreClient.java:269] MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. getPartitionsByNames
      Java exception follows:
      org.apache.thrift.TApplicationException: Internal error processing get_partitions_by_names_req
      ...
      

      It then fails in the retry, and needs manual invalidate metadata:

      I1129 14:10:08.381718    92 RetryingMetaStoreClient.java:185] RetryingMetaStoreClient trying reconnect as ubuntu (auth:SIMPLE)
      ...
      I1129 14:10:08.395550    92 HiveMetaStoreClient.java:821] Connected to metastore.
      I1129 14:10:08.400802    92 MetastoreEventsProcessor.java:1026] Time elapsed in processing event batch: 1s035ms
      E1129 14:10:08.401016    92 MetastoreEventsProcessor.java:859] Event processing needs a invalidate command to resolve the state
      Java exception follows:
      org.apache.impala.catalog.events.MetastoreNotificationNeedsInvalidateException: EventId: 21539 EventType: INSERT Refresh partition on table {} partition {} failed. Event processing cannot continue. Issue an invalidate metadata command to reset the event processor state.
              at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.processPartitionInserts(MetastoreEvents.java:1298)
              at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.process(MetastoreEvents.java:1277)
              at org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:514)
              at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1013)
              at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:851)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: org.apache.impala.catalog.CatalogException: Unexpected error while retrieving partitions for table test_alter_table_hdfs_caching_95073aa2.i1670e_alter
              at org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames(HdfsTable.java:2768)
              at org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExist(CatalogOpExecutor.java:4501)
              at org.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions(MetastoreEvents.java:879)
              at org.apache.impala.catalog.events.MetastoreEvents$InsertEvent.processPartitionInserts(MetastoreEvents.java:1295)
              ... 11 more
      Caused by: org.apache.thrift.TApplicationException: Internal error processing get_partitions_by_names_req
              at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79)
              at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_partitions_by_names_req(ThriftHiveMetastore.java:4162)
              at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_partitions_by_names_req(ThriftHiveMetastore.java:4149)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNamesInternal(HiveMetaStoreClient.java:2500)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2491)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2468)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2445)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getPartitionsByNames(HiveMetaStoreClient.java:2409)
              at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:498)
              at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:212)
              at com.sun.proxy.$Proxy11.getPartitionsByNames(Unknown Source)
              at org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames(HdfsTable.java:2749)
              ... 14 more
      

      In HMS logs, the internal error is that the table is dropped:

      2022-11-29T14:10:08,398 ERROR [TThreadPoolServer WorkerProcess-28] metastore.RetryingHMSHandler: InvalidObjectException(message:hive.test_alter_table_hdfs_caching_95073aa2.i1670e_alter table not found)
      ...
      2022-11-29T14:10:08,398 ERROR [TThreadPoolServer WorkerProcess-28] thrift.ProcessFunction: Internal error processing get_partitions_by_names_req
      org.apache.hadoop.hive.metastore.api.InvalidObjectException: hive.test_alter_table_hdfs_caching_95073aa2.i1670e_alter table not found
      ...
      

      It's dropped by a concurrent DROP DATABASE CASCADE command. In the catalogd logs:

      I1129 14:10:07.005285   104 JniCatalog.java:257] execDdl request: DROP_DATABASE test_alter_table_hdfs_caching_95073aa2 issued by ubuntu
      ...
      I1129 14:10:07.387348   104 CatalogServiceCatalog.java:3290] Should wait for next update for DATABASE:test_alter_table_hdfs_caching_95073aa2: older version 4077 is sent. Expects a version >= 4727.
      

      When failed to process the event, the event-processor should doulbe check whether the target (db/table) still exists in HMS in case of such concurrent drop operations. If the target no longer exists, just skip the failure and invalidate the corresponding catalog object.

      Attachments

        1. impalad.b99660bbd523.invalid-user.log.INFO.20221129-132601.1.bz2
          37.89 MB
          Quanlong Huang
        2. hive-server2.log.gz
          6.51 MB
          Quanlong Huang
        3. hive-metastore.log.gz
          5.51 MB
          Quanlong Huang
        4. catalogd.84223c9b5251.invalid-user.log.INFO.20221129-132600.1.gz
          1.81 MB
          Quanlong Huang

        Issue Links

          Activity

            People

              hemanth619 Sai Hemanth Gantasala
              stigahuang Quanlong Huang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: