Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22408

RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Trivial
    • Resolution: Fixed
    • 2.2.0
    • 2.3.0
    • SQL
    • None

    Description

      When calculating the distinct values for a pivot in RelationalGroupedDataset (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L322), we sort before doing a take(maxValues + 1).

      We should be able to improve this by adding a global limit before the sort, which should reduce the work of the sort, and by simply doing a collect to avoid multiple launching multiple stages as a part of the take.

      Attachments

        Activity

          People

            pwoody Patrick Woody
            pwoody Patrick Woody
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: