Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-18156

[Python/C++] High memory usage/potential leak when reading parquet using Dataset API

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 4.0.1
    • None
    • Parquet
    • None

    Description

      Hi,

      I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the following snippet:

       

      import os
      import pyarrow
      import pyarrow.dataset as ds
      from importlib_metadata import version
      from psutil import Process
      import pyarrow.parquet as pq
      
      def format_bytes(num_bytes: int):
          return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
       
      def main():
          print(version("pyarrow"))
          print(pyarrow.default_memory_pool().backend_name)
          process = Process(os.getpid())
          runs = 10
          print(f"Runs: {runs}")
          for i in range(runs):
              dataset = ds.dataset("df.pq")
              table = dataset.to_table()
              df = table.to_pandas()
              print(f"After run {i}: RSS = {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes = {format_bytes(pyarrow.total_allocated_bytes())}")
      

       

       

      On PyArrow v4.0.1 the output is as follows:

      4.0.1
      system
      Runs: 10
      After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
      After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB

      If I replace ds.dataset("df.pq").to_table() with pq.ParquetFile("df.pq").read(), the output is:

      4.0.1
      system
      Runs: 10
      After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
      After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB

      The usage profile of the older non-dataset API is much lower - it matches the size of the dataframe much closer. It also seems like in the former example, there is a memory leak? I thought that the increase in RSS was just due to PyArrow's usage of jemalloc, but I seem to be using the system allocator here.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Norbo11 Norbert
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: