Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-24624

Can not mix vectorized and non-vectorized UDFs

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.1
    • 2.4.0
    • SQL
    • None

    Description

      In the current impl, we have the limitation: users are unable to mix vectorized and non-vectorized UDFs in same Project. This becomes worse since our optimizer could combine continuous Projects into a single one. For example, 

      
      applied_df = df.withColumn('regular', my_regular_udf('total', 'qty')).withColumn('pandas', my_pandas_udf('total', 'qty'))
      
      

      Returns the following error. 

      
      IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
      
      java.lang.IllegalArgumentException: Can not mix vectorized and non-vectorized UDFs
       at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:170)
       at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$6.apply(ExtractPythonUDFs.scala:146)
       at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
       at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
       at scala.collection.immutable.List.foreach(List.scala:381)
       at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
       at scala.collection.immutable.List.map(List.scala:285)
       at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:146)
       at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:118)
       at org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$6.apply(TreeNode.scala:312)
       at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
       at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
       at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
       at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
       at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:309)
       at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$8.apply(TreeNode.scala:331)
       at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:208)
       at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:329)
       at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
       at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:114)
       at org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:94)
       at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
       at org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:113)
       at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
       at scala.collection.immutable.List.foldLeft(List.scala:84)
       at org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:113)
       at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
       at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:99)
       at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3312)
       at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:2750)
       ...
      
      

      Attachments

        Activity

          People

            icexelloss Li Jin
            smilegator Xiao Li
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: