Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
-
None
-
ghx-label-9
Description
Saw TestCatalogRPCTimeout.test_catalog_rpc_retries failed in an internal job. The commit id is 624b6c646819d7a15d550169ceca4ce5a3fa794f. The failure is
custom_cluster/test_rpc_timeout.py:275: in test_catalog_rpc_retries self.execute_query("refresh functional.alltypes") common/impala_test_suite.py:878: in wrapper return function(*args, **kwargs) common/impala_test_suite.py:910: in execute_query return self.__execute_query(self.client, query, query_options) common/impala_test_suite.py:1003: in __execute_query return impalad_client.execute(query, user=user) common/impala_connection.py:214: in execute return self.__beeswax_client.execute(sql_stmt, user=user) beeswax/impala_beeswax.py:191: in execute handle = self.__execute_query(query_string.strip(), user=user) beeswax/impala_beeswax.py:369: in __execute_query self.wait_for_finished(handle) beeswax/impala_beeswax.py:390: in wait_for_finished raise ImpalaBeeswaxException("Query aborted:" + error_log, None) E ImpalaBeeswaxException: ImpalaBeeswaxException: E Query aborted:NullPointerException: null
catalogd logs:
I0105 08:19:22.862653 28878 debug-util.cc:439] Debug Action: RESET_METADATA_DELAY:JITTER@10000@0.75 sleeping for 4791 ms I0105 08:19:27.671103 28878 JniUtil.java:166] 3d452b4ae20cdf17:93549bd700000000] resetMetadata request: REFRESH TABLE functional.alltypes issued by jenkins I0105 08:19:27.679502 28902 TableLoader.java:76] Loading metadata for: functional.alltypes (Load triggered by REFRESH TABLE functional.alltypes issued by jenkins) I0105 08:19:27.679899 28878 TableLoadingMgr.java:72] 3d452b4ae20cdf17:93549bd700000000] Loading metadata for table: functional.alltypes I0105 08:19:27.679981 28878 TableLoadingMgr.java:74] 3d452b4ae20cdf17:93549bd700000000] Remaining items in queue: 0. Loads in progress: 1 I0105 08:19:27.709326 28902 Table.java:234] createEventId_ for table: functional.alltypes set to: -1 I0105 08:19:27.709725 28902 HdfsTable.java:1251] Loading metadata for table definition and all partition(s) of functional.alltypes (Load triggered by REFRESH TABLE functional.alltypes issued by jenkins) I0105 08:19:27.771229 28902 HdfsTable.java:1886] Loaded 13 columns from HMS. Actual columns: 13 I0105 08:19:27.805235 28902 HdfsTable.java:3056] Load Valid Write Id List Done. Time taken: 26.465us I0105 08:19:27.805305 28902 HdfsTable.java:1291] Fetching partition metadata from the Metastore: functional.alltypes I0105 08:19:27.809267 28902 MetaStoreUtil.java:190] Fetching 24 partitions for: functional.alltypes using partition batch size: 1000 I0105 08:19:27.821784 28902 MetaStoreUtil.java:208] Fetched 24/24 partitions for table functional.alltypes I0105 08:19:27.821811 28902 HdfsTable.java:1298] Fetched partition metadata from the Metastore: functional.alltypes I0105 08:19:27.862032 28776 TAcceptQueueServer.cpp:355] New connection to server CatalogService from client <Host: 127.0.0.1 Port: 39038> I0105 08:19:27.870393 28903 JniUtil.java:166] 3d452b4ae20cdf17:93549bd700000000] resetMetadata request: REFRESH TABLE functional.alltypes issued by jenkins I0105 08:19:27.870957 28903 TableLoadingMgr.java:72] 3d452b4ae20cdf17:93549bd700000000] Loading metadata for table: functional.alltypes I0105 08:19:27.871035 28903 TableLoadingMgr.java:74] 3d452b4ae20cdf17:93549bd700000000] Remaining items in queue: 0. Loads in progress: 1 I0105 08:19:27.907255 28902 ParallelFileMetadataLoader.java:224] Loading file and block metadata for 24 paths for table functional.alltypes using a thread pool of size 5 I0105 08:19:27.957284 28902 HdfsTable.java:821] Loaded file and block metadata for functional.alltypes partitions: year=2009/month=1, year=2009/month=10, year=2009/month=11, and 21 others. Time taken: 53.850ms I0105 08:19:27.959899 28902 Table.java:1081] last refreshed event id for table: functional.alltypes set to: -1 I0105 08:19:27.959956 28902 TableLoader.java:176] Loaded metadata for: functional.alltypes (280ms) E0105 08:19:27.980352 28903 JniUtil.java:183] 3d452b4ae20cdf17:93549bd700000000] Error in REFRESH TABLE functional.alltypes issued by jenkins. Time spent: 110ms I0105 08:19:27.980885 28903 jni-util.cc:302] 3d452b4ae20cdf17:93549bd700000000] java.lang.NullPointerException at org.apache.impala.catalog.monitor.CatalogOperationTracker.archiveRecord(CatalogOperationTracker.java:104) at org.apache.impala.catalog.monitor.CatalogOperationTracker.decrement(CatalogOperationTracker.java:160) at org.apache.impala.service.CatalogOpExecutor.execResetMetadata(CatalogOpExecutor.java:6614) at org.apache.impala.service.JniCatalog.lambda$resetMetadata$4(JniCatalog.java:321) at org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) at org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100) at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:237) at org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:251) at org.apache.impala.service.JniCatalog.resetMetadata(JniCatalog.java:320) I0105 08:19:27.980897 28903 status.cc:129] 3d452b4ae20cdf17:93549bd700000000] NullPointerException: null
The REFRESH request from the same query id was sent twice to catalogd, causing conflicts in the inFlightOperations map.
100 private void archiveRecord(TUniqueId queryId, String errorMsg) { 101 if (queryId != null && inFlightOperations.containsKey(queryId)) { 102 TCatalogOpRecord record = inFlightOperations.remove(queryId); 103 if (catalogOperationLogSize == 0) return; 104 record.setFinish_time_ms(System.currentTimeMillis()); // record could be removed by another thread so become null 105 if (errorMsg != null) { 106 record.setStatus("FAILED"); 107 record.setDetails(record.getDetails() + ", error=" + errorMsg); 108 } else { 109 record.setStatus("FINISHED"); 110 } 111 synchronized (finishedOperations) { 112 if (finishedOperations.size() >= catalogOperationLogSize) { 113 finishedOperations.poll(); 114 } 115 finishedOperations.add(record); 116 } 117 } 118 }
We should use queryId+threadId as the key in case of catalog RPC retries.
It's not a severe issue in production cluster since catalog_client_rpc_timeout_ms is 0 by default. So catalog RPC never timeout and got retried.
Attachments
Issue Links
- causes
-
IMPALA-12716 Failing TestWebPage.test_catalog_operations_with_rpc_retry
- Resolved
- is caused by
-
IMPALA-9118 Add debug page for in-flight DDLs in catalogd
- Resolved