[HIVE-535] Memory-efficient hash-based Aggregation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.4.0
Fix Version/s: None
Component/s: Query Processor
Labels:
- optimization

Description

Currently there are a lot of memory overhead in the hash-based aggregation in GroupByOperator.
The net result is that GroupByOperator won't be able to store many entries in its HashTable, and flushes frequently, and won't be able to achieve very good partial aggregation result.

Here are some initial thoughts (some of them are from Joydeep long time ago):

A1. Serialize the key of the HashTable. This will eliminate the 16-byte per-object overhead of Java in keys (depending on how many objects there are in the key, the saving can be substantial).
A2. Use more memory-efficient hash tables - java.util.HashMap has about 64 bytes of overhead per entry.
A3. Use primitive array to store aggregation results. Basically, the UDAF should manage the array of aggregation results, so UDAFCount should manage a long[], UDAFAvg should manage a double[] and a long[]. The external code should pass an index to iterate/merge/terminal an aggregation result. This will eliminate the 16-byte per-object overhead of Java.

More ideas are welcome.

Attachments

Issue Links

is blocked by

HIVE-949 Object deepCopy in GroupBy Operator

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Zheng Shao

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Jun/09 20:02

Updated:: 05/Apr/11 22:51