Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24650

GroupingSet

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.3.1
    • None
    • SQL
    • CDH 5.X, Spark 2.3

    • Patch, Important

    Description

      If a grouping set is used in spark sql, then the plan does not perform optimally.

      If input to a grouping set is X rows and the grouping sets has y group, then the number of rows that are processed is currently x*y rows.

      Example : Let a Dataframe have  col1, col2, col3 and col4 columns and number of row be rowNo.

      and grouping set consist of : (1) col1, col2, col3 (2) col2,col4 (3) col1,col2

      Number of row processed in such case is 3*(rowNos * size of each row).

      However is this the optimal way of processing data.

      If the groups of y are derivable for each other, can we reduce the amount of volume processed by removing columns as we progress to the lower dimension of processing.

      Currently while doing processing percentile, a lot of data seems to be processed causing performance issue.

      Need to look if this can be optimised

      Attachments

        Activity

          People

            Unassigned Unassigned
            MihirSahu Mihir Sahu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: