Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28128

Pandas Grouped UDFs should skip over empty partitions

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.3
    • Fix Version/s: 3.0.0
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

      When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. For example, ArrowPythonRunner.compute is called and starts a number of threads that do nothing since there is no iteration. These computations could be skipped for empty partitions, which will save time overall.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bryanc Bryan Cutler
                Reporter:
                bryanc Bryan Cutler
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: