[ARROW-10052] [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 1.0.1
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/26073

Description

This ticket refers to the discussion between me and emkornfield on the MailingList: "Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)" (not yet available on the mail archives)

Original post:

Hi,
I'm trying to write a large parquet file onto disk (larger then memory) using PyArrows ParquetWriter and write_table, but even though the file is written incrementally to disk it still appears to keeps the entire dataset in memory (eventually getting OOM killed). Basically what I am trying to do is:
with pq.ParquetWriter(
output_file,
arrow_schema,
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
for rows_dataframe in function_that_yields_data():
writer.write_table(
pa.Table.from_pydict(
rows_dataframe,
arrow_schema
)
)
Where I have a function that yields data and then write it in chunks using write_table.
Is it possible to force the ParquetWriter to not keep the entire dataset in memory, or is it simply not possible for good reasons?
I’m streaming data from a database and writes it to Parquet. The end-consumer has plenty of ram, but the machine that does the conversion doesn’t.
Regards,
Niklas

Minimum example (I can't attach as a file for some reason) https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95

Looking at it now when I've made a minimal example I see something I didn't see/realize before which is that while the memory usage is increasing it doesn't appear to be linear to the file written. This indicates (I guess) that it isn't actually storing the written dataset, but something else.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Niklas B

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/Sep/20 09:12

Updated:: 11/Jan/23 08:10

Resolved:: 18/Nov/20 12:46