Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34600 Support user defined types in Pandas UDF
  3. SPARK-36283

Bug when creating dataframe without schema and with Arrow disabled

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 3.1.1
    • None
    • PySpark
    • None

    Description

      A reproducible small repo can be found here: https://github.com/darcy-shen/spark-36283

      Case 1: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and without schema

      spark = SparkSession.builder.getOrCreate()
      spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
      pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
      df = spark.createDataFrame(pdf)
      df.show()
      

      Incorrect result

      +----------+
      |     point|
      +----------+
      |(0.0, 0.0)|
      |(0.0, 0.0)|
      +----------+
      

      Case 2: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with unmatched schema

      spark = SparkSession.builder.getOrCreate()
      spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
      pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])})
      schema = StructType([StructField('point', ExamplePointUDT(), False)])
      df = spark.createDataFrame(pdf, schema)
      df.show()
      

      Error throwed as expected

      Traceback (most recent call last):
        File "bug2.py", line 54, in <module>
          df = spark.createDataFrame(pdf, schema)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 673, in createDataFrame
          return super(SparkSession, self).createDataFrame(
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py", line 300, in createDataFrame
          return self._create_dataframe(data, schema, samplingRatio, verifySchema)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 700, in _create_dataframe
          rdd, schema = self._createFromLocal(map(prepare, data), schema)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 509, in _createFromLocal
          data = list(data)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/session.py", line 682, in prepare
          verify_func(obj)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
          verify_value(obj)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1390, in verify_struct
          verifier(v)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
          verify_value(obj)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1304, in verify_udf
          verifier(dataType.toInternal(obj))
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
          verify_value(obj)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1354, in verify_array
          element_verifier(i)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1409, in verify
          verify_value(obj)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1403, in verify_default
          verify_acceptable_types(obj)
        File "/Users/da/.pyenv/versions/spark-36283/lib/python3.8/site-packages/pyspark/sql/types.py", line 1291, in verify_acceptable_types
          raise TypeError(new_msg("%s can not accept object %r in type %s"
      TypeError: element in array field point: DoubleType can not accept object 1 in type <class 'int'>
      

      Case 3: Create PySpark Dataframe using Pandas DataFrame with Arrow disabled and with matched schema

      spark = SparkSession.builder.getOrCreate()
      spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "false")
      pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), ExamplePoint(2.0, 2.0)])})
      schema = StructType([StructField('point', ExamplePointUDT(), False)])
      df = spark.createDataFrame(pdf, schema)
      df.show()
      

      Correct result

      +----------+
      |     point|
      +----------+
      |(1.0, 1.0)|
      |(2.0, 2.0)|
      +----------+
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            sadhen Darcy Shen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: