Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28422

GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0, 2.4.1, 2.4.2, 2.4.3
    • Fix Version/s: 3.0.0
    • Component/s: PySpark, SQL
    • Labels:
      None

      Description

       

      from pyspark.sql.functions import pandas_udf, PandasUDFType
      @pandas_udf('double', PandasUDFType.GROUPED_AGG)
      def max_udf(v):
          return v.max()
      
      df = spark.range(0, 100)
      spark.udf.register('max_udf', max_udf)
      df.createTempView('table')
      
      # A. This works
      df.agg(max_udf(df['id'])).show()
      
      # B. This doesn't work
      spark.sql("select max_udf(id) from table").show()

       

       

      Query plan:

      A:

      == Parsed Logical Plan ==
      
      'Aggregate [max_udf('id) AS max_udf(id)#140]
      
      +- Range (0, 1000, step=1, splits=Some(4))
      
      
      
      
      == Analyzed Logical Plan ==
      
      max_udf(id): double
      
      Aggregate [max_udf(id#64L) AS max_udf(id)#140]
      
      +- Range (0, 1000, step=1, splits=Some(4))
      
      
      
      
      == Optimized Logical Plan ==
      
      Aggregate [max_udf(id#64L) AS max_udf(id)#140]
      
      +- Range (0, 1000, step=1, splits=Some(4))
      
      
      
      
      == Physical Plan ==
      
      !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
      
      +- Exchange SinglePartition
      
         +- *(1) Range (0, 1000, step=1, splits=4)
      

      B:

      == Parsed Logical Plan ==
      
      'Project [unresolvedalias('max_udf('id), None)]
      
      +- 'UnresolvedRelation [table]
      
      
      
      
      == Analyzed Logical Plan ==
      
      max_udf(id): double
      
      Project [max_udf(id#0L) AS max_udf(id)#136]
      
      +- SubqueryAlias `table`
      
         +- Range (0, 100, step=1, splits=Some(4))
      
      
      
      
      == Optimized Logical Plan ==
      
      Project [max_udf(id#0L) AS max_udf(id)#136]
      
      +- Range (0, 100, step=1, splits=Some(4))
      
      
      
      
      == Physical Plan ==
      
      *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
      
      +- *(1) Range (0, 100, step=1, splits=4)
      

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                viirya Liang-Chi Hsieh
                Reporter:
                icexelloss Li Jin
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: