[IMPALA-12561] Event-processor shouldn't go into ERROR state for failures in fetching events - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Impala 4.0.0, Impala 3.4.0, Impala 3.4.1, Impala 4.1.0, Impala 4.2.0, Impala 4.1.1, Impala 4.1.2, Impala 4.3.0
Fix Version/s: Impala 4.4.0
Component/s: Catalog
Labels:
None

Epic Link:
event-processor-completeness
Epic Color:
ghx-label-1

Description

Since ~~IMPALA-8240~~, we allow event-processor to retry for MetastoreNotificationFetchExceptions. However, there are several places that we haven't converted HMS failures in fetching events into MetastoreNotificationFetchExceptions:

1. getNextMetastoreEvents() throws IllegalStateException if it fails to create a MetaStoreClient.

E1024 05:00:58.458434   258 MetastoreEventsProcessor.java:888] Unexpected exception received while processing event
Java exception follows:
java.lang.IllegalStateException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:105)   at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:78)    at org.apache.impala.catalog.MetaStoreClientPool.getClient(MetaStoreClientPool.java:205)        at org.apache.impala.catalog.Catalog.getMetaStoreClient(Catalog.java:397)       at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:802)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:848)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:869)   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)      at java.lang.Thread.run(Thread.java:750)Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient       at org.apache.hadoop.hive.metastore.utils.JavaUtils.newInstance(JavaUtils.java:86)      at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:98)     at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:151)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:122)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:115)  at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:99)    ... 13 moreCaused by: java.lang.reflect.InvocationTargetException       at sun.reflect.GeneratedConstructorAccessor948.newInstance(Unknown Source)      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423)      at org.apache.hadoop.hive.metastore.utils.JavaUtils.newInstance(JavaUtils.java:84)      ... 18 moreCaused by: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: Peer indicated failure: Failure to initialize security context        at org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:171)       at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:244)     at org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:39)  at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:51) at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:48) at java.security.AccessController.doPrivileged(Native Method)   at javax.security.auth.Subject.doAs(Subject.java:422)   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.hive.metastore.security.TUGIAssumingTransport.open(TUGIAssumingTransport.java:48)  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:758)      at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:271)    at sun.reflect.GeneratedConstructorAccessor948.newInstance(Unknown Source)      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423)      at org.apache.hadoop.hive.metastore.utils.JavaUtils.newInstance(JavaUtils.java:84)      at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:98)     at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:151)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:122)  at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:115)  at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:99)    at org.apache.impala.catalog.MetaStoreClientPool$MetaStoreClient.<init>(MetaStoreClientPool.java:78)    at org.apache.impala.catalog.MetaStoreClientPool.getClient(MetaStoreClientPool.java:205)        at org.apache.impala.catalog.Catalog.getMetaStoreClient(Catalog.java:397)       at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:802)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.getNextMetastoreEvents(MetastoreEventsProcessor.java:848)  at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:869)   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)      at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)      at java.lang.Thread.run(Thread.java:750)
)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:829)
        at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:271)
        ... 22 more

2. processEvents() doesn't handle the failures of getCurrentEventId() as MetastoreNotificationFetchExceptions. Instead, getCurrentEventId() throws CatalogException:

E1114 16:01:11.121475 28921 MetastoreEventsProcessor.java:942] Unexpected exception received while processing event
Java exception follows:
org.apache.impala.catalog.CatalogException: Unable to fetch the current notification event id. Check if metastore service is accessible
        at org.apache.impala.catalog.events.MetastoreEventsProcessor.getCurrentEventId(MetastoreEventsProcessor.java:744)
        at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:922)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
E1114 16:01:11.136973 28921 MetastoreEventsProcessor.java:1190] Notification event is null 
W1114 16:01:11.137122 28921 MetastoreEventsProcessor.java:913] Event processing is skipped since status is ERROR. Last synced event id is 8406252

Event-processor should distinguish these HMS errors and don't go into the ERROR state. So it can retry until the connection to HMS is back to normal.

Attachments

Issue Links

relates to

IMPALA-8240 Event processor should keep trying if metastore is unavailable

Resolved

Activity

ASF subversion and git services added a comment - 24/Dec/23 00:04

Commit 5af8fef199b60fb7725971b419596a36e48b1eec in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5af8fef19 ]

~~IMPALA-12561~~: Event-processor shouldn't go into ERROR state for failures in fetching events

Any failures in fetching HMS events should be retriable. Event-processor
should not go into the ERROR state which can only be recovered by a
global INVALIDATE METADATA command.

This patch deals with the failure in creating a new MetaStoreClient
by throwing a MetastoreClientInstantiationException instead of an
IllegalStateException. Previously the IllegalStateException could fail
the process of fetching HMS events. Now callers can catch the
MetastoreClientInstantiationException and convert it into
MetastoreNotificationFetchException if the process is retriable. So the
event-processor can retry in the next round. There are still other
callers of Catalog#getMetaStoreClient() that don't catch the new
exception since their work can't be easily retried.

Also makes sure MetastoreEventsProcessor.getCurrentEventId() only throws
MetastoreNotificationFetchException. Previously it throws
CatalogException which will fail the event-processor. Note that
CatalogException is used for errors in accessing objects in the Catalog,
e.g. table not found. We shouldn't throw it when fetching HMS events
fails.

Tests:

Add FE unit test to verify MetastoreNotificationFetchException is
thrown as expected. To mimic HMS connection failures, use a
customized MetastoreClientPool that uses wrong HMS port.
Add e2e test in custom_cluster/test_catalog_hms_failures.py. The test
class previously only runs in exhaustive jobs due to long running
time. Optimize the test to only restart HMS. Adds a new option,
-if_not_running, for run-hive-server.sh to avoid unneccessary
restarts.

Change-Id: I775684d473fdbfb9f0531234f59a6239bd0873e3
Reviewed-on: http://gerrit.cloudera.org:8080/20707
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

ASF subversion and git services added a comment - 24/Dec/23 00:04 Commit 5af8fef199b60fb7725971b419596a36e48b1eec in impala's branch refs/heads/master from stiga-huang [ https://gitbox.apache.org/repos/asf?p=impala.git;h=5af8fef19 ] IMPALA-12561 : Event-processor shouldn't go into ERROR state for failures in fetching events Any failures in fetching HMS events should be retriable. Event-processor should not go into the ERROR state which can only be recovered by a global INVALIDATE METADATA command. This patch deals with the failure in creating a new MetaStoreClient by throwing a MetastoreClientInstantiationException instead of an IllegalStateException. Previously the IllegalStateException could fail the process of fetching HMS events. Now callers can catch the MetastoreClientInstantiationException and convert it into MetastoreNotificationFetchException if the process is retriable. So the event-processor can retry in the next round. There are still other callers of Catalog#getMetaStoreClient() that don't catch the new exception since their work can't be easily retried. Also makes sure MetastoreEventsProcessor.getCurrentEventId() only throws MetastoreNotificationFetchException. Previously it throws CatalogException which will fail the event-processor. Note that CatalogException is used for errors in accessing objects in the Catalog, e.g. table not found. We shouldn't throw it when fetching HMS events fails. Tests: Add FE unit test to verify MetastoreNotificationFetchException is thrown as expected. To mimic HMS connection failures, use a customized MetastoreClientPool that uses wrong HMS port. Add e2e test in custom_cluster/test_catalog_hms_failures.py. The test class previously only runs in exhaustive jobs due to long running time. Optimize the test to only restart HMS. Adds a new option, -if_not_running, for run-hive-server.sh to avoid unneccessary restarts. Change-Id: I775684d473fdbfb9f0531234f59a6239bd0873e3 Reviewed-on: http://gerrit.cloudera.org:8080/20707 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

IMPALA

Event-processor shouldn't go into ERROR state for failures in fetching events

Details

Description

Attachments

Issue Links

Activity

People

Dates