[ARROW-7305] [Python] High memory usage writing pyarrow.Table with large strings to parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.15.1
Fix Version/s: None
Component/s: Python
Labels:
- parquet
Environment:
Mac OSX

External issue URL:
https://github.com/apache/arrow/issues/23592

Description

My case of datasets stored is specific. I have large strings (1-100MB each).

Let's take for example a single row.

43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.

When I read this csv with pandas and then dump to parquet, my script consumes 10x of the 43mb.

With increasing amount of such rows memory footprint overhead diminishes, but I want to focus on this specific case.

Here's the footprint after running using memory profiler:

Line #    Mem usage    Increment   Line Contents
================================================
     4     48.9 MiB     48.9 MiB   @profile
     5                             def test():
     6    143.7 MiB     94.7 MiB       data = pd.read_csv('43mb.csv')
     7    498.6 MiB    354.9 MiB       data.to_parquet('out.parquet')

Is this typical for parquet in case of big strings?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

50mb.csv.gz
18/Dec/19 00:02
289 kB
Bogdan Klichuk

Issue Links

relates to

ARROW-6994 [C++] Research jemalloc memory page reclamation configuration on macOS when background_thread option is unavailable

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Bogdan Klichuk

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Dec/19 02:21

Updated:: 11/Jan/23 07:52