[SPARK-24650] GroupingSet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.1
Fix Version/s: None
Component/s: SQL
Labels:
Environment:

CDH 5.X, Spark 2.3

Flags:

Patch, Important

Description

If a grouping set is used in spark sql, then the plan does not perform optimally.

If input to a grouping set is X rows and the grouping sets has y group, then the number of rows that are processed is currently x*y rows.

Example : Let a Dataframe have col1, col2, col3 and col4 columns and number of row be rowNo.

and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2

Number of row processed in such case is 3*(rowNos * size of each row).

However is this the optimal way of processing data.

If the groups of y are derivable for each other, can we reduce the amount of volume processed by removing columns as we progress to the lower dimension of processing.

Currently while doing processing percentile, a lot of data seems to be processed causing performance issue.

Need to look if this can be optimised

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Mihir Sahu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Jun/18 17:51

Updated:: 12/Dec/22 18:10

Resolved:: 08/Oct/19 05:44