Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28128

Pandas Grouped UDFs should skip over empty partitions

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.3
    • 3.0.0
    • PySpark, SQL
    • None

    Description

      When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same. For example, ArrowPythonRunner.compute is called and starts a number of threads that do nothing since there is no iteration. These computations could be skipped for empty partitions, which will save time overall.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            bryanc Bryan Cutler Assign to me
            bryanc Bryan Cutler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Issue deployment