Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34600 Support user defined types in Pandas UDF
  3. SPARK-34771

Support UDT for Pandas/Spark conversion with Arrow support Enabled

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.2, 3.1.1
    • None
    • PySpark
    • None

    Description

      spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      from pyspark.testing.sqlutils  import ExamplePoint
      import pandas as pd
      pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
      df = spark.createDataFrame(pdf)
      df.toPandas()
      

      with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS.

      with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off.

      Detailed steps to reproduce:

      $ bin/pyspark
      Python 3.8.8 (default, Feb 24 2021, 13:46:16)
      [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
      Setting default log level to "WARN".
      To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
      21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
            /_/
      
      Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
      Spark context Web UI available at http://172.30.0.226:4040
      Spark context available as 'sc' (master = local[*], app id = local-1615994008526).
      SparkSession available as 'spark'.
      >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
      >>> from pyspark.testing.sqlutils  import ExamplePoint
      >>> import pandas as pd
      >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
      >>> df = spark.createDataFrame(pdf)
      /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
        Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type
      Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
        warnings.warn(msg)
      >>>
      >>> df.show()
      +----------+
      |     point|
      +----------+
      |(0.0, 0.0)|
      |(0.0, 0.0)|
      +----------+
      
      >>> df.schema
      StructType(List(StructField(point,ExamplePointUDT,true)))
      >>> df.toPandas()
      /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
        Unsupported type in conversion to Arrow: ExamplePointUDT
      Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
        warnings.warn(msg)
             point
      0  (0.0,0.0)
      1  (0.0,0.0)
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            sadhen Darcy Shen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: