[SPARK-30239] Creating a dataframe with Pandas rather than Numpy datatypes fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.4.3
Fix Version/s: None
Component/s: PySpark
Labels:
None
Environment:

DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 | Scala 2.11

Description

It's possible to work with DataFrames in Pandas and shuffle them back over to Spark dataframes for processing; however, using Pandas extended datatypes like Int64 ( https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html ) throws an error (that long / float can't be converted).

This is internally because np.nan is a float, and pd.Int64DType() allows only integers except for the single float value np.nan.

The current workaround for this is to use the columns as floats, and after conversion to the Spark DataFrame, to recast the column as LongType(). For example:

sdfC = spark.createDataFrame(kgridCLinked)

sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))

However, this is awkward and redundant.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Philip Kahn

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 12/Dec/19 19:40

Updated:: 12/Dec/22 18:11

Resolved:: 23/Jan/20 02:35