Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-13187

[c++][python] Possibly memory not deallocated when reading in CSV

    XMLWordPrintableJSON

Details

    Description

      When one reads in a table from CSV in pyarrow version 4.0.1, it appears that the read-in table variable is not freed (or not fast enough). I'm unsure if this is because of pyarrow or because of the way pyarrow memory allocation interacts with Python memory allocation. I encountered it when processing many large CSVs sequentially.

      When I run the following piece of code, the RAM memory usage increases quite rapidly until it runs out of memory.

      import pyarrow as pa
      import pyarrow.csv
      
      # Generate some CSV file to read in
      print("Generating CSV")
      with open("example.csv", "w+") as f_out:
          for i in range(0, 10000000):
              f_out.write("123456789,abc def ghi jkl\n")
      
      
      def read_in_the_csv():
          table = pa.csv.read_csv("example.csv")
          print(table)  # Not strictly necessary to replicate bug, table can also be an unused variable
          # This will free up the memory, as a workaround:
          # table = table.slice(0, 0)
      
      
      # Read in the CSV many times
      print("Reading in a CSV many times")
      for j in range(100000):
          read_in_the_csv()
      

      Attachments

        1. forward-refs.png
          241 kB
          Weston Pace
        2. backward-refs.png
          107 kB
          Weston Pace

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              snkas Simon
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 50m
                  1h 50m