Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.4.3
-
None
-
None
-
DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 | Scala 2.11
Description
It's possible to work with DataFrames in Pandas and shuffle them back over to Spark dataframes for processing; however, using Pandas extended datatypes like Int64 ( https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html ) throws an error (that long / float can't be converted).
This is internally because np.nan is a float, and pd.Int64DType() allows only integers except for the single float value np.nan.
The current workaround for this is to use the columns as floats, and after conversion to the Spark DataFrame, to recast the column as LongType(). For example:
sdfC = spark.createDataFrame(kgridCLinked)
sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))
However, this is awkward and redundant.