Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
4.0.0, 3.5.1
Description
Over in the Apache Arrow community, we hear from a lot of users who want to return the contents of a PySpark DataFrame as a PyArrow Table. Currently the only documented way to do this is:
PySpark DataFrame --> pandas DataFrame --> PyArrow Table
This adds significant overhead compared to going direct from PySpark DataFrame to PyArrow Table. Since PySpark already goes through PyArrow to convert to pandas, would it be possible to publicly expose a toArrow() method of the Spark DataFrame class?
Attachments
Issue Links
- contains
-
SPARK-47465 Remove experimental tag from toArrow() PySpark DataFrame method
- Resolved
- is related to
-
SPARK-48220 Allow passing PyArrow Table to createDataFrame()
- Resolved
-
SPARK-47466 Add PySpark DataFrame method to return iterator of PyArrow RecordBatches
- Open
- links to