[SPARK-47466] Add PySpark DataFrame method to return iterator of PyArrow RecordBatches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 4.0.0, 3.5.1
Fix Version/s: None
Component/s: Connect, Input/Output, PySpark, SQL
Labels:
None

Language:
- Python

Description

As a follow-up to ~~SPARK-47365~~:

toArrow() is useful when the data is relatively small. For larger data, the best way to return the contents of a PySpark DataFrame in Arrow format is to return an iterator of PyArrow RecordBatches.

Attachments

Issue Links

is related to

SPARK-48478 Allow passing iterator of PyArrow RecordBatches to createDataFrame()

Open

relates to

SPARK-47365 Add toArrow() DataFrame method to PySpark

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ian Cook

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Mar/24 16:58

Updated:: 02/Jun/24 13:15