[ARROW-2709] [Python] write_to_dataset poor performance when splitting - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: 0.14.0
Component/s: Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/19099

Description

Hello,

Posting this from github (master wesmckinn asked for it )

https://github.com/apache/arrow/issues/2138

import pandas as pd 
import numpy as np 
import pyarrow.parquet as pq 
import pyarrow as pa 

idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T') 
dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)), 
                          'string_col' : pd.util.testing.rands_array(8,len(idx))}, 
                         index = idx)

df["dt"] = df.index 
df["dt"] = df["dt"].dt.date 
table = pa.Table.from_pandas(df) 
pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], flavor='spark')

this works but is inefficient memory-wise. The arrow table is a copy of the large pandas daframe and quickly saturates the RAM.

Thanks!

Attachments

Issue Links

is duplicated by

ARROW-2628 [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

Open

links to

GitHub Pull Request #3344

Activity

People

Assignee:: Unassigned

Reporter:: Olaf

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 13/Jun/18 21:15

Updated:: 11/Jan/23 07:22

Resolved:: 29/Apr/19 11:51

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m