Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9878

[Python] table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory

    XMLWordPrintableJSON

Details

    Description

      Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7

       

      Reproduce code:

      Generate about 800MB data first.

      import pyarrow as pa
      
      # generate about 800MB data
      data = [pa.array([10]* 1000)]
      batch = pa.record_batch(data, names=['f0'])
      with open('/tmp/t1.pa', 'wb') as f1:
      	writer = pa.ipc.new_stream(f1, batch.schema)
      	for i in range(100000):
      		writer.write_batch(batch)
      	writer.close()
      

      Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False

      import pyarrow as pa
      import time
      import sys
      
      import os
      pid = os.getpid()
      print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
      sys.stdin.readline()
      
      with open('/tmp/t1.pa', 'rb') as f1:
      	reader = pa.ipc.open_stream(f1)
      	batches = [b for b in reader]
      
      pa_table = pa.Table.from_batches(batches)
      del batches
      time.sleep(3)
      pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, use_threads=False)
      del pa_table
      time.sleep(3)
      

      The attached file is psrecord profiling result.

      Attachments

        1. t001.png
          27 kB
          Weichen Xu

        Issue Links

          Activity

            People

              lidavidm David Li
              weichenxu123 Weichen Xu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m