[IMPALA-3173] Reduce catalog's memory footprint - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 2.2.4
Fix Version/s: None
Component/s: Catalog
Labels:

Target Version:

Product Backlog

Description

An initial analysis of catalog's heap dumps shows that we can probably reduce it's memory footprint by: a) avoid storing redundant information about catalog entities such as partitions, b) using more compressed data structures.

Currently, for a table with 2 int columns and 1 int partition column and without incremental stats, we use:

~930B per partition out of which ~500B are used on hmsParameters_ (<String, String>Map), ~190B on cachedMsPartitionDescriptor_, and ~200B (depending on path) on location.
~800B per file descriptor out of which ~530B go to file_blocks and the rest are used for storing the file_name.
Every HdfsTable also uses two maps that replicate partition locations and file names (e.g. perPartitionFileDescMap_ and nameToPartitionMap_).

A table like that with 100,000 partitions and 10 files per partition requires 1GB and 1.4GB of memory w and w/o incremental stats, respectively.

This is a parent JIRA of IMPALA-2840.

Attachments

Issue Links

is a child of

IMPALA-5299 Improve catalog scalability and large catalog handling

Open

relates to

IMPALA-5990 End-to-end compression of metadata

Resolved

Sub-Tasks

1.	Avoid storing redundant information about partitions in the catalog	Open	Unassigned
2.	Store partition location info with respect to partition keys	Open	Unassigned
3.	Prefer binary over string in catalog thrift definitions	Open	Tianyi Wang
4.	Reduce working memory when processing metadata cache updates	Open	Unassigned

Activity

People

Assignee:: Unassigned

Reporter:: Dimitris Tsirogiannis

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 10/Mar/16 23:14

Updated:: 13/Feb/20 18:45