Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-23334

Fix pandas_udf with return type StringType() to handle str type properly in Python 2.

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.0
    • Component/s: PySpark, SQL
    • Labels:
      None
    • Target Version/s:

      Description

      In Python 2, when pandas_udf tries to return string type value created in the udf with "..", the execution fails. E.g.,

      from pyspark.sql.functions import pandas_udf, col
      import pandas as pd
      
      df = spark.range(10)
      str_f = pandas_udf(lambda x: pd.Series(["%s" % i for i in x]), "string")
      df.select(str_f(col('id'))).show()
      

      raises the following exception:

      ...
      
      java.lang.AssertionError: assertion failed: Invalid schema from pandas_udf: expected StringType, got BinaryType
      	at scala.Predef$.assert(Predef.scala:170)
      	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:93)
      
      ...
      

      Seems like pyarrow ignores type parameter for pa.Array.from_pandas() and consider it as binary type when the type is string type and the string values are str instead of unicode in Python 2.

        Attachments

          Activity

            People

            • Assignee:
              ueshin Takuya Ueshin
              Reporter:
              ueshin Takuya Ueshin
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: