[SPARK-28128] Pandas Grouped UDFs should skip over empty partitions - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.3
Fix Version/s: 3.0.0
Component/s: PySpark, SQL
Labels:
None

Description

When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. For example, ArrowPythonRunner.compute is called and starts a number of threads that do nothing since there is no iteration. These computations could be skipped for empty partitions, which will save time overall.

Attachments

Issue Links

links to

GitHub Pull Request #24926

Activity

People

Assignee:: Bryan Cutler

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 21/Jun/19 01:21

Updated:: 12/Dec/22 18:10

Resolved:: 22/Jun/19 02:21