Currently with Arrow enabled, calling toPandas() results in a collection of all partitions in the JVM in the form of batches of Arrow file format. Once collected in the JVM, they are served to the Python driver process.
I believe using the Arrow stream format can help to optimize this and reduce memory consumption in the JVM by only loading one record batch at a time before sending it to Python. This might also reduce the latency between making the initial call in Python and receiving the first batch of records.
- is blocked by
SPARK-23874 Upgrade apache/arrow to 0.10.0
- relates to
SPARK-25274 Improve toPandas with Arrow by sending out-of-order record batches
- links to