[SPARK-20791] Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.1
Fix Version/s: 2.3.0
Component/s: PySpark, SQL
Labels:
None

Description

The current code for creating a Spark DataFrame from a Pandas DataFrame uses `to_records` to convert the DataFrame to a list of records and then converts each record to a list. Following this, there are a number of calls to serialize and transfer this data to the JVM. This process is very inefficient and also discards all schema metadata, requiring another pass over the data to infer types.

Using Apache Arrow, the Pandas DataFrame could be efficiently converted to Arrow data and directly transferred to the JVM to create the Spark DataFrame. The performance will be better and the Pandas schema will also be used so that the correct types will be used.

Issues with the poor type inference have come up before, causing confusion and frustration with users because it is not clear why it fails or doesn't use the same type from Pandas. Fixing this with Apache Arrow will solve another pain point for Python users and the following JIRAs could be closed:

Attachments

Issue Links

is blocked by

SPARK-13534 Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

Resolved

SPARK-21583 Create a ColumnarBatch with ArrowColumnVectors for row based iteration

Resolved

is related to

SPARK-17804 Pandas dtypes are not correctly inferred by pyspark

Resolved

SPARK-18178 Importing Pandas Tables with Missing Values

Resolved

SPARK-11758 Missing Index column while creating a DataFrame from Pandas

Resolved

relates to

SPARK-22417 createDataFrame from a pandas.DataFrame reads datetime64 values as longs

Resolved

links to

[Github] Pull Request #19459 (BryanCutler)

[Github] Pull Request #19738 (BryanCutler)

(1 relates to, 2 links to)

Activity

People

Assignee:: Bryan Cutler

Reporter:: Bryan Cutler

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 18/May/17 00:15

Updated:: 12/Dec/22 18:11

Resolved:: 13/Nov/17 04:16