-
Type:
Improvement
-
Status: Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: Impala 2.2.4
-
Fix Version/s: None
-
Component/s: Catalog
-
Labels:
-
Target Version:
An initial analysis of catalog's heap dumps shows that we can probably reduce it's memory footprint by: a) avoid storing redundant information about catalog entities such as partitions, b) using more compressed data structures.
Currently, for a table with 2 int columns and 1 int partition column and without incremental stats, we use:
- ~930B per partition out of which ~500B are used on hmsParameters_ (<String, String>Map), ~190B on cachedMsPartitionDescriptor_, and ~200B (depending on path) on location.
- ~800B per file descriptor out of which ~530B go to file_blocks and the rest are used for storing the file_name.
- Every HdfsTable also uses two maps that replicate partition locations and file names (e.g. perPartitionFileDescMap_ and nameToPartitionMap_).
A table like that with 100,000 partitions and 10 files per partition requires 1GB and 1.4GB of memory w and w/o incremental stats, respectively.
This is a parent JIRA of IMPALA-2840.
- is a child of
-
IMPALA-5299 Improve catalog scalability and large catalog handling
-
- Open
-
- relates to
-
IMPALA-5990 End-to-end compression of metadata
-
- Resolved
-
1.
|
Avoid storing redundant information about partitions in the catalog |
|
Open | Unassigned |
2.
|
Store partition location info with respect to partition keys |
|
Open | Unassigned |
3.
|
Prefer binary over string in catalog thrift definitions |
|
Open | Tianyi Wang |
4.
|
Reduce working memory when processing metadata cache updates |
|
Open | Unassigned |