[SPARK-22408] RelationalGroupedDataset's distinct pivot value calculation launches unnecessary stages - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Trivial
Resolution: Fixed
Affects Version/s: 2.2.0
Fix Version/s: 2.3.0
Component/s: SQL
Labels:
None

Description

When calculating the distinct values for a pivot in RelationalGroupedDataset (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L322), we sort before doing a take(maxValues + 1).

We should be able to improve this by adding a global limit before the sort, which should reduce the work of the sort, and by simply doing a collect to avoid multiple launching multiple stages as a part of the take.

Attachments

Issue Links

links to

[Github] Pull Request #19629 (pwoody)

Activity

People

Assignee:: Patrick Woody

Reporter:: Patrick Woody

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/Nov/17 15:42

Updated:: 02/Nov/17 13:20

Resolved:: 02/Nov/17 13:20