[IMPALA-3173] Reduce catalog's memory footprint - ASF JIRA

Attach files

Attach Screenshot

Add vote

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 2.2.4
Fix Version/s: None
Component/s: Catalog
Labels:

Target Version:

Product Backlog

Description

An initial analysis of catalog's heap dumps shows that we can probably reduce it's memory footprint by: a) avoid storing redundant information about catalog entities such as partitions, b) using more compressed data structures.

Currently, for a table with 2 int columns and 1 int partition column and without incremental stats, we use:

~930B per partition out of which ~500B are used on hmsParameters_ (<String, String>Map), ~190B on cachedMsPartitionDescriptor_, and ~200B (depending on path) on location.
~800B per file descriptor out of which ~530B go to file_blocks and the rest are used for storing the file_name.
Every HdfsTable also uses two maps that replicate partition locations and file names (e.g. perPartitionFileDescMap_ and nameToPartitionMap_).

A table like that with 100,000 partitions and 10 files per partition requires 1GB and 1.4GB of memory w and w/o incremental stats, respectively.

This is a parent JIRA of IMPALA-2840.

Attachments

Issue Links

Add Link

is a child of

IMPALA-5299 Improve catalog scalability and large catalog handling

Open

Delete this link

relates to

IMPALA-5990 End-to-end compression of metadata

Resolved

Delete this link

Sub-Tasks

Create Sub-Task

1.	Avoid storing redundant information about partitions in the catalog	Open	Unassigned	Actions
2.	Store partition location info with respect to partition keys	Open	Unassigned	Actions
3.	Optimize HdfsTable::perPartitionFileDescMap_ to reduce memory usage	Resolved	Unassigned	Actions
4.	Consider using listLocatedStatus() API to get filestatus and blocklocations in one RPC call	Resolved	Bharath Vissapragada	Actions
5.	Reduce memory requirements for storing THdfsFileDesc	Resolved	Dimitris Tsirogiannis	Actions
6.	Prefer binary over string in catalog thrift definitions	Open	Tianyi Wang	Actions
7.	Reduce working memory when processing metadata cache updates	Open	Unassigned	Actions