When executing toPandas with Arrow enabled, partitions that arrive in the JVM out-of-order must be buffered before they can be send to Python. This causes an excess of memory to be used in the driver JVM and increases the time it takes to complete because data must sit in the JVM waiting for preceding partitions to come in.
This can be improved by sending out-of-order partitions to Python as soon as they arrive in the JVM, followed by a list of indices so that Python can assemble the data in the correct order. This way, data is not buffered at the JVM and there is no waiting on particular partitions so performance will be increased.
- is related to
SPARK-23030 Decrease memory consumption with toPandas() collection using Arrow
- links to