Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30239

Creating a dataframe with Pandas rather than Numpy datatypes fails

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.4.3
    • None
    • PySpark
    • None
    • DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 | Scala 2.11

    Description

      It's possible to work with DataFrames in Pandas and shuffle them back over to Spark dataframes for processing; however, using Pandas extended datatypes like Int64 https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html ) throws an error (that long / float can't be converted).

      This is internally because np.nan is a float, and pd.Int64DType() allows only integers except for the single float value np.nan.

       

      The current workaround for this is to use the columns as floats, and after conversion to the Spark DataFrame, to recast the column as LongType(). For example:

       

      sdfC = spark.createDataFrame(kgridCLinked)

      sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))

       

      However, this is awkward and redundant.

      Attachments

        Activity

          People

            Unassigned Unassigned
            tigerhawkvok Philip Kahn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: