Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22408

RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When calculating the distinct values for a pivot in RelationalGroupedDataset (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L322), we sort before doing a take(maxValues + 1).

      We should be able to improve this by adding a global limit before the sort, which should reduce the work of the sort, and by simply doing a collect to avoid multiple launching multiple stages as a part of the take.

        Attachments

          Activity

            People

            • Assignee:
              pwoody Patrick Woody
              Reporter:
              pwoody Patrick Woody
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: