Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-22274

User-defined aggregation functions with pandas udf

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotVotersStop watchingWatchersConvert to IssueLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.0
    • Fix Version/s: 2.4.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      This function doesn't implement partial aggregation and shuffles all data. A uadf that supports partial aggregation is not covered by this Jira.
      Exmaple:

      @pandas_udf(DoubleType())
      def mean(v)
            return v.mean()
      
      df.groupby('id').apply(mean(df.v1), mean(df.v2))
      

        Attachments

          Activity

          $i18n.getText('security.level.explanation', $currentSelection) Viewable by All Users
          Cancel

            People

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment