Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.3.1
-
None
-
CDH 5.X, Spark 2.3
-
Patch, Important
Description
If a grouping set is used in spark sql, then the plan does not perform optimally.
If input to a grouping set is X rows and the grouping sets has y group, then the number of rows that are processed is currently x*y rows.
Example : Let a Dataframe have col1, col2, col3 and col4 columns and number of row be rowNo.
and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2
Number of row processed in such case is 3*(rowNos * size of each row).
However is this the optimal way of processing data.
If the groups of y are derivable for each other, can we reduce the amount of volume processed by removing columns as we progress to the lower dimension of processing.
Currently while doing processing percentile, a lot of data seems to be processed causing performance issue.
Need to look if this can be optimised