[IMPALA-11812] Catalogd OOM due to lots of HMS FieldSchema instances - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Impala 4.3.0
Component/s: Catalog
Labels:
None

Target Version:

Impala 4.1.2
Epic Color:
ghx-label-10

Description

For partitioned wide tables that have thousands of columns, catalogd might hit OOM in routines on them. E.g. when running AlterTableRecoverPartitions for all their partitions, or when initially loading all partitions of them.

The direct reason is that the heap is full of HMS FieldSchema instances. Here is a histogram of the issue in a 4GB heap:

Class Name                                                   |     Objects |  Shallow Heap |
--------------------------------------------------------------------------------------------
org.apache.hadoop.hive.metastore.api.FieldSchema             | 111,876,486 | 2,685,035,664 |
java.lang.Object[]                                           |      78,026 |   449,929,656 |
char[]                                                       |      91,295 |     6,241,744 |
java.util.ArrayList                                          |     171,126 |     4,107,024 |
java.util.HashMap                                            |      71,135 |     3,414,480 |
java.lang.String                                             |      91,161 |     2,187,864 |
java.util.concurrent.ConcurrentHashMap$Node                  |      59,614 |     1,907,648 |
java.util.concurrent.atomic.LongAdder                        |      53,021 |     1,696,672 |
org.apache.hadoop.hive.metastore.api.Partition               |      22,374 |     1,610,928 |
com.codahale.metrics.EWMA                                    |      30,780 |     1,477,440 |
com.codahale.metrics.LongAdderProxy$JdkProvider$1            |      53,021 |     1,272,504 |
org.apache.hadoop.hive.metastore.api.StorageDescriptor       |      22,376 |     1,253,056 |
java.util.Hashtable$Entry                                    |      36,921 |     1,181,472 |
java.util.concurrent.atomic.AtomicLong                       |      39,444 |       946,656 |
org.apache.hadoop.hive.metastore.api.SerDeInfo               |      22,375 |       895,000 |
byte[]                                                       |       1,686 |       668,480 |
java.util.concurrent.ConcurrentHashMap$Node[]                |       1,874 |       639,824 |
com.codahale.metrics.ExponentiallyDecayingReservoir          |      10,259 |       574,504 |
java.util.HashMap$Node                                       |      17,776 |       568,832 |
org.apache.hadoop.hive.metastore.api.SkewedInfo              |      22,375 |       537,000 |
com.codahale.metrics.Meter                                   |      10,260 |       492,480 |
java.util.concurrent.ConcurrentSkipListMap                   |      10,260 |       492,480 |
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync|      10,259 |       492,432 |
org.apache.impala.catalog.ColumnStats                        |       5,003 |       400,240 |                 
Total: 24 of 6,158 entries; 6,130 more                       | 113,007,927 | 3,174,330,288 |

In the above case, these FieldSchema instances come from the list of hmsPartitions that is created locally by CatalogOpExecutor#alterTableRecoverPartitions(). The thread is 0x6d051abb8:

Stacktrace:

Thread 0x6d051abb8
  at org.apache.hadoop.hive.metastore.api.StorageDescriptor.<init>(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V (StorageDescriptor.java:216)
  at org.apache.hadoop.hive.metastore.api.StorageDescriptor.deepCopy()Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor; (StorageDescriptor.java:256)
  at org.apache.impala.service.CatalogOpExecutor.createHmsPartitionFromValues(Ljava/util/List;Lorg/apache/hadoop/hive/metastore/api/Table;Lorg/apache/impala/analysis/TableName;Ljava/lang/String;)Lorg/apache/hadoop/hive/metastore/api/Partition; (CatalogOpExecutor.java:5787)
  at org.apache.impala.service.CatalogOpExecutor.alterTableRecoverPartitions(Lorg/apache/impala/catalog/Table;Ljava/lang/String;)V (CatalogOpExecutor.java:5678)
  at org.apache.impala.service.CatalogOpExecutor.alterTable(Lorg/apache/impala/thrift/TAlterTableParams;Ljava/lang/String;ZLorg/apache/impala/thrift/TDdlExecResponse;)V (CatalogOpExecutor.java:1208)
  at org.apache.impala.service.CatalogOpExecutor.execDdlRequest(Lorg/apache/impala/thrift/TDdlExecRequest;)Lorg/apache/impala/thrift/TDdlExecResponse; (CatalogOpExecutor.java:419)
  at org.apache.impala.service.JniCatalog.execDdl([B)[B (JniCatalog.java:260)

How this happen

When creating the list of hmsPartitions, we deep copy the StorageDescriptor which will also deep copy the column list:

alterTableRecoverPartitions()
-> createHmsPartitionFromValues()
   -> StorageDescriptor sd = msTbl.getSd().deepCopy();

Impala doesn't respect the partition level schema (by design), we should share the list of FieldSchema across hmsPartitions.

When loading partition metadata for such a table, we could also hit this issue. The HMS API "get_partitions_by_names" returns the list of hmsPartitions. Each of them reference a unique list of FieldSchemas. We should deduplicate them to share the same column list. FWIW, attached the heap analysis results for wide table loading OOM (4GB heap):

The FieldSchema instances come from the metadata loading thread:

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

create_ext_tbl_with_5k_cols_50k_parts.sh
26/Dec/22 04:37
1.0 kB
Quanlong Huang
MAT_dominator_tree.png
26/Dec/22 03:57
94 kB
Quanlong Huang
wide-table-loading-oom-FieldSchema-path2root.png
28/Dec/22 06:13
72 kB
Quanlong Huang
wide-table-loading-oom-histogram.png
28/Dec/22 06:13
172 kB
Quanlong Huang
wide-table-loading-oom-top-consumers.png
28/Dec/22 06:13
376 kB
Quanlong Huang

Issue Links

is related to

HIVE-26893 Extend batch partition APIs to ignore partition schemas

Closed

links to

Code Review

Activity

Ascending order - Click to sort in descending order

Quanlong Huang added a comment - 26/Dec/22 04:38

FWIW, uploaded a script create_ext_tbl_with_5k_cols_50k_parts.sh to create such a table.

Quanlong Huang added a comment - 26/Dec/22 04:38 FWIW, uploaded a script create_ext_tbl_with_5k_cols_50k_parts.sh to create such a table.

Quanlong Huang added a comment - 27/Dec/22 13:54

Uploaded a fix for review: https://gerrit.cloudera.org/c/19391/

Quanlong Huang added a comment - 27/Dec/22 13:54 Uploaded a fix for review: https://gerrit.cloudera.org/c/19391/

ASF subversion and git services added a comment - 01/Jan/23 07:14

Commit 77d80aeda653b3aecb8bc41bf867cc5a84ba1245 in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=77d80aeda ]

~~IMPALA-11812~~: Deduplicate column schema in hmsPartitions

A list of HMS Partitions will be created in many workloads in catalogd,
e.g. table loading, bulk altering partitions by ComputeStats or
AlterTableRecoverPartitions, etc. Currently, each of hmsPartition hold a
unique list of column schema, i.e. a List<FieldSchema>. This results in
lots of FieldSchema instances if the table is wide and lots of
partitions need to be loaded/operated. Though the strings of column
names and comments are interned, the FieldSchema objects could still
occupy the majority of the heap. See the histogram in JIRA description.

In reality, the hmsPartition instances of a table can share the
table-level column schema since Impala doesn't respect the partition
level schema.

This patch replaces column list in StorageDescriptor of hmsPartitions
with the table level column list to remove the duplications. Also add
some progress logs in batch HMS operations, and avoid misleading logs
when event-processor is disabled.

Tests:

Ran exhaustive tests
Add tests on wide table operations that hit OOM errors without this
fix.

Change-Id: I511ecca0ace8bea4c24a19a54fb0a75390e50c4d
Reviewed-on: http://gerrit.cloudera.org:8080/19391
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

ASF subversion and git services added a comment - 01/Jan/23 07:14 Commit 77d80aeda653b3aecb8bc41bf867cc5a84ba1245 in impala's branch refs/heads/master from stiga-huang [ https://gitbox.apache.org/repos/asf?p=impala.git;h=77d80aeda ] IMPALA-11812 : Deduplicate column schema in hmsPartitions A list of HMS Partitions will be created in many workloads in catalogd, e.g. table loading, bulk altering partitions by ComputeStats or AlterTableRecoverPartitions, etc. Currently, each of hmsPartition hold a unique list of column schema, i.e. a List<FieldSchema>. This results in lots of FieldSchema instances if the table is wide and lots of partitions need to be loaded/operated. Though the strings of column names and comments are interned, the FieldSchema objects could still occupy the majority of the heap. See the histogram in JIRA description. In reality, the hmsPartition instances of a table can share the table-level column schema since Impala doesn't respect the partition level schema. This patch replaces column list in StorageDescriptor of hmsPartitions with the table level column list to remove the duplications. Also add some progress logs in batch HMS operations, and avoid misleading logs when event-processor is disabled. Tests: Ran exhaustive tests Add tests on wide table operations that hit OOM errors without this fix. Change-Id: I511ecca0ace8bea4c24a19a54fb0a75390e50c4d Reviewed-on: http://gerrit.cloudera.org:8080/19391 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

Quanlong Huang added a comment - 01/Jan/23 07:18

Resolving this. Thank amansinha and sql_forever for the review!

Quanlong Huang added a comment - 01/Jan/23 07:18 Resolving this. Thank amansinha and sql_forever for the review!

IMPALA

Catalogd OOM due to lots of HMS FieldSchema instances

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates