Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10052

[Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 1.0.1
    • None
    • Python
    • None

    Description

      This ticket refers to the discussion between me and emkornfield on the MailingList: "Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)" (not yet available on the mail archives)

      Original post:

      Hi,
      I'm trying to write a large parquet file onto disk (larger then memory) using PyArrows ParquetWriter and write_table, but even though the file is written incrementally to disk it still appears to keeps the entire dataset in memory (eventually getting OOM killed). Basically what I am trying to do is:
      with pq.ParquetWriter(
                      output_file,
                      arrow_schema,
                      compression='snappy',
                      allow_truncated_timestamps=True,
                      version='2.0',  # Highest available schema
                      data_page_version='2.0',  # Highest available schema
              ) as writer:
                  for rows_dataframe in function_that_yields_data():
                      writer.write_table(
                          pa.Table.from_pydict(
                                  rows_dataframe,
                                  arrow_schema
                          )
                      )
      Where I have a function that yields data and then write it in chunks using write_table. 
      Is it possible to force the ParquetWriter to not keep the entire dataset in memory, or is it simply not possible for good reasons?
      I’m streaming data from a database and writes it to Parquet. The end-consumer has plenty of ram, but the machine that does the conversion doesn’t. 
      Regards,
      Niklas

      Minimum example (I can't attach as a file for some reason) https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95

      Looking at it now when I've made a minimal example I see something I didn't see/realize before which is that while the memory usage is increasing it doesn't appear to be linear to the file written. This indicates (I guess) that it isn't actually storing the written dataset, but something else. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            bivald Niklas B
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: