[HIVE-15580] Eliminate unbounded memory usage for orderBy and groupBy in Hive on Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3.0
Component/s: Spark
Labels:
None

Description

Currently, orderBy (sortBy) and groupBy in Hive on Spark uses unbounded memory. For orderBy, Hive accumulates key groups using ArrayList (described in ~~HIVE-15527~~). For groupBy, Hive currently uses Spark's groupByKey operator, which has a shortcoming of not being able to spill to disk within a key group. Thus, for large key group, memory usage is also unbounded.

It's likely that this will impact performance. We will profile and optimize afterwards. We could also make this change configurable.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-15580.patch
11/Jan/17 06:03
5 kB
Xuefu Zhang
HIVE-15580.1.patch
11/Jan/17 17:28
8 kB
Xuefu Zhang
HIVE-15580.1.patch
11/Jan/17 20:19
8 kB
Xuefu Zhang
HIVE-15580.2.patch
13/Jan/17 13:53
15 kB
Xuefu Zhang
HIVE-15580.2.patch
18/Jan/17 15:46
15 kB
Xuefu Zhang
HIVE-15580.3.patch
18/Jan/17 17:46
21 kB
Xuefu Zhang
HIVE-15580.4.patch
19/Jan/17 05:58
21 kB
Xuefu Zhang
HIVE-15580.5.patch
19/Jan/17 13:33
37 kB
Xuefu Zhang

Issue Links

incorporates

HIVE-15527 Memory usage is unbound in SortByShuffler for Spark

Resolved

relates to

HIVE-15682 Eliminate per-row based dummy iterator creation

Resolved

HIVE-15683 Make what's done in HIVE-15580 for group by configurable

Resolved

Activity

People

Assignee:: Xuefu Zhang

Reporter:: Xuefu Zhang

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Jan/17 06:02

Updated:: 21/Jul/17 18:36

Resolved:: 20/Jan/17 20:57