[ARROW-432] [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.2.0
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/16079

Description

I'll take this one on.

While we're efficiently constructing individual NumPy arrays for pandas, even in the zero-copy case pandas.DataFrame will perform an extra memory copy and consolidation step internally at the end.

This is particular to the pandas 0.x/1.x memory layout, and will change in the future with pandas 2.0, but that's quite a ways off from wide use.

We can avoid this overhead for now by

computing the exact internal "block" structure of the DataFrame. Since we know the null counts of the Arrow data, we can determine if type casts to accommodate nulls are necessary up front

pre-allocating empty column-major blocks

writing out into the block slices

construct DataFrame from blocks with zero copy

Attachments

Issue Links

is related to

ARROW-428 [Python] Deserialize from Arrow record batches to pandas in parallel using a thread pool

Resolved

Activity

People

Assignee:: Wes McKinney

Reporter:: Wes McKinney

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Dec/16 00:14

Updated:: 11/Jan/23 07:08

Resolved:: 24/Dec/16 15:04

Agile

View on Board

[Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs