Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-12561

Event-processor shouldn't go into ERROR state for failures in fetching events

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • Impala 4.0.0, Impala 3.4.0, Impala 3.4.1, Impala 4.1.0, Impala 4.2.0, Impala 4.1.1, Impala 4.1.2, Impala 4.3.0
    • Impala 4.4.0
    • Catalog
    • None

    Description

      Since IMPALA-8240, we allow event-processor to retry for MetastoreNotificationFetchExceptions. However, there are several places that we haven't converted HMS failures in fetching events into MetastoreNotificationFetchExceptions:

      1. getNextMetastoreEvents() throws IllegalStateException if it fails to create a MetaStoreClient.

      E1024 05:00:58.458434   258 MetastoreEventsProcessor.java:888] Unexpected exception received while processing event
      Java exception follows:
      java.lang.IllegalStateException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:105)   at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:78)    at org.apache.impala.catalog.MetaStoreClientPool.getClient(MetaStoreClientPool.java:205)        at org.apache.impala.catalog.Catalog.getMetaStoreClient(Catalog.java:397)       at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:802)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:848)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:869)   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)      at java.lang.Thread.run(Thread.java:750)Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient       at org.apache.hadoop.hive.metastore.utils.JavaUtils.newInstance(JavaUtils.java:86)      at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:98)     at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:151)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:122)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:115)  at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:99)    ... 13 moreCaused by: java.lang.reflect.InvocationTargetException       at sun.reflect.GeneratedConstructorAccessor948.newInstance(Unknown Source)      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423)      at org.apache.hadoop.hive.metastore.utils.JavaUtils.newInstance(JavaUtils.java:84)      ... 18 moreCaused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: Peer indicated failure: Failure to initialize security context        at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:171)       at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:244)     at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:39)  at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:51) at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:48) at java.security.AccessController.doPrivileged(Native Method)   at javax.security.auth.Subject.doAs(Subject.java:422)   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport.open(TUGIAssumingTransport.java:48)  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:758)      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:271)    at sun.reflect.GeneratedConstructorAccessor948.newInstance(Unknown Source)      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423)      at org.apache.hadoop.hive.metastore.utils.JavaUtils.newInstance(JavaUtils.java:84)      at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:98)     at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:151)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:122)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:115)  at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:99)    at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:78)    at org.apache.impala.catalog.MetaStoreClientPool.getClient(MetaStoreClientPool.java:205)        at org.apache.impala.catalog.Catalog.getMetaStoreClient(Catalog.java:397)       at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:802)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:848)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:869)   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)      at java.lang.Thread.run(Thread.java:750)
      )
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:829)
              at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:271)
              ... 22 more
      

      2. processEvents() doesn't handle the failures of getCurrentEventId() as MetastoreNotificationFetchExceptions. Instead, getCurrentEventId() throws CatalogException:

      E1114 16:01:11.121475 28921 MetastoreEventsProcessor.java:942] Unexpected exception received while processing event
      Java exception follows:
      org.apache.impala.catalog.CatalogException: Unable to fetch the current notification event id. Check if metastore service is accessible
              at org.apache.impala.catalog.events.MetastoreEventsProcessor.getCurrentEventId(MetastoreEventsProcessor.java:744)
              at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:922)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:750)
      E1114 16:01:11.136973 28921 MetastoreEventsProcessor.java:1190] Notification event is null 
      W1114 16:01:11.137122 28921 MetastoreEventsProcessor.java:913] Event processing is skipped since status is ERROR. Last synced event id is 8406252
      

      Event-processor should distinguish these HMS errors and don't go into the ERROR state. So it can retry until the connection to HMS is back to normal.

      Attachments

        Issue Links

          Activity

            Commit 5af8fef199b60fb7725971b419596a36e48b1eec in impala's branch refs/heads/master from stiga-huang
            [ https://gitbox.apache.org/repos/asf?p=impala.git;h=5af8fef19 ]

            IMPALA-12561: Event-processor shouldn't go into ERROR state for failures in fetching events

            Any failures in fetching HMS events should be retriable. Event-processor
            should not go into the ERROR state which can only be recovered by a
            global INVALIDATE METADATA command.

            This patch deals with the failure in creating a new MetaStoreClient
            by throwing a MetastoreClientInstantiationException instead of an
            IllegalStateException. Previously the IllegalStateException could fail
            the process of fetching HMS events. Now callers can catch the
            MetastoreClientInstantiationException and convert it into
            MetastoreNotificationFetchException if the process is retriable. So the
            event-processor can retry in the next round. There are still other
            callers of Catalog#getMetaStoreClient() that don't catch the new
            exception since their work can't be easily retried.

            Also makes sure MetastoreEventsProcessor.getCurrentEventId() only throws
            MetastoreNotificationFetchException. Previously it throws
            CatalogException which will fail the event-processor. Note that
            CatalogException is used for errors in accessing objects in the Catalog,
            e.g. table not found. We shouldn't throw it when fetching HMS events
            fails.

            Tests:

            • Add FE unit test to verify MetastoreNotificationFetchException is
              thrown as expected. To mimic HMS connection failures, use a
              customized MetastoreClientPool that uses wrong HMS port.
            • Add e2e test in custom_cluster/test_catalog_hms_failures.py. The test
              class previously only runs in exhaustive jobs due to long running
              time. Optimize the test to only restart HMS. Adds a new option,
              -if_not_running, for run-hive-server.sh to avoid unneccessary
              restarts.

            Change-Id: I775684d473fdbfb9f0531234f59a6239bd0873e3
            Reviewed-on: http://gerrit.cloudera.org:8080/20707
            Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
            Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

            jira-bot ASF subversion and git services added a comment - Commit 5af8fef199b60fb7725971b419596a36e48b1eec in impala's branch refs/heads/master from stiga-huang [ https://gitbox.apache.org/repos/asf?p=impala.git;h=5af8fef19 ] IMPALA-12561 : Event-processor shouldn't go into ERROR state for failures in fetching events Any failures in fetching HMS events should be retriable. Event-processor should not go into the ERROR state which can only be recovered by a global INVALIDATE METADATA command. This patch deals with the failure in creating a new MetaStoreClient by throwing a MetastoreClientInstantiationException instead of an IllegalStateException. Previously the IllegalStateException could fail the process of fetching HMS events. Now callers can catch the MetastoreClientInstantiationException and convert it into MetastoreNotificationFetchException if the process is retriable. So the event-processor can retry in the next round. There are still other callers of Catalog#getMetaStoreClient() that don't catch the new exception since their work can't be easily retried. Also makes sure MetastoreEventsProcessor.getCurrentEventId() only throws MetastoreNotificationFetchException. Previously it throws CatalogException which will fail the event-processor. Note that CatalogException is used for errors in accessing objects in the Catalog, e.g. table not found. We shouldn't throw it when fetching HMS events fails. Tests: Add FE unit test to verify MetastoreNotificationFetchException is thrown as expected. To mimic HMS connection failures, use a customized MetastoreClientPool that uses wrong HMS port. Add e2e test in custom_cluster/test_catalog_hms_failures.py. The test class previously only runs in exhaustive jobs due to long running time. Optimize the test to only restart HMS. Adds a new option, -if_not_running, for run-hive-server.sh to avoid unneccessary restarts. Change-Id: I775684d473fdbfb9f0531234f59a6239bd0873e3 Reviewed-on: http://gerrit.cloudera.org:8080/20707 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

            People

              stigahuang Quanlong Huang
              stigahuang Quanlong Huang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: