Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7305

[Python] High memory usage writing pyarrow.Table with large strings to parquet

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.15.1
    • None
    • Python
    • Mac OSX

    Description

      My case of datasets stored is specific. I have large strings (1-100MB each).

      Let's take for example a single row.

      43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.

      When I read this csv with pandas and then dump to parquet, my script consumes 10x of the 43mb.

      With increasing amount of such rows memory footprint overhead diminishes, but I want to focus on this specific case.

      Here's the footprint after running using memory profiler:

      Line #    Mem usage    Increment   Line Contents
      ================================================
           4     48.9 MiB     48.9 MiB   @profile
           5                             def test():
           6    143.7 MiB     94.7 MiB       data = pd.read_csv('43mb.csv')
           7    498.6 MiB    354.9 MiB       data.to_parquet('out.parquet')
       

      Is this typical for parquet in case of big strings?

      Attachments

        1. 50mb.csv.gz
          289 kB
          Bogdan Klichuk

        Issue Links

          Activity

            People

              Unassigned Unassigned
              klichukb Bogdan Klichuk
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: