[SPARK-13534] Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.3.0
Component/s: PySpark
Labels:
None

Description

The current code path for accessing Spark DataFrame data in Python using PySpark passes through an inefficient serialization-deserialiation process that I've examined at a high level here: https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] objects are being deserialized in pure Python as a list of tuples, which are then converted to pandas.DataFrame using its from_records alternate constructor. This also uses a large amount of memory.

For flat (no nested types) schemas, the Apache Arrow memory layout (https://github.com/apache/arrow/tree/master/format) can be deserialized to pandas.DataFrame objects with comparatively small overhead compared with memcpy / system memory bandwidth – Arrow's bitmasks must be examined, replacing the corresponding null values with pandas's sentinel values (None or NaN as appropriate).

I will be contributing patches to Arrow in the coming weeks for converting between Arrow and pandas in the general case, so if Spark can send Arrow memory to PySpark, we will hopefully be able to increase the Python data access throughput by an order of magnitude or more. I propose to add an new serializer for Spark DataFrame and a new method that can be invoked from PySpark to request a Arrow memory-layout byte stream, prefixed by a data header indicating array buffer offsets and sizes.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

benchmark.py
25/Jan/17 00:22
2 kB
Bryan Cutler

Issue Links

blocks

SPARK-20791 Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame

Resolved

is duplicated by

SPARK-13391 Use Apache Arrow as In-memory columnar store implementation

Resolved

is related to

ARROW-288 Implement Arrow adapter for Spark Datasets

Resolved

relates to

SPARK-19489 Stable serialization format for external & native code integration

Resolved

SPARK-14141 Let user specify datatypes of pandas dataframe in toPandas()

Resolved

links to

[Github] Pull Request #15821 (BryanCutler)

[Github] Pull Request #18459 (BryanCutler)

(2 links to)

Activity

People

Assignee:: Bryan Cutler

Reporter:: Wes McKinney

Shepherd:: Reynold Xin

Votes:: 6 Vote for this issue

Watchers:: 72 Start watching this issue

Dates

Created:: 28/Feb/16 02:06

Updated:: 07/Oct/17 14:28

Resolved:: 23/Jun/17 01:01