Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8801

[Python] Memory leak on read from parquet file with UTC timestamps using pandas

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.16.0, 0.17.0
    • 1.0.0
    • Python
    • Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux).

    Description

      Given dump.py script 

       

      import pandas as pd
      import numpy as np
      
      
      x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', utc=True)
      pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', compression=None)
      

      and load.py script

       

      import sys
      import pandas as pd
      
      def foo(engine):
          for _ in range(2**9):
              pd.read_parquet('data.parquet', engine=engine)
          print('Done')
          input()
      
      foo(sys.argv[1])
      

      running first "python dump.py" and then "python load.py pyarrow", on my machine python memory usage stays at 4+ GB while it waits for input. If using "python load.py fastparquet" instead, it is about 100 MB, so it should be a pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is removed from dump.py, in which case the timestamp is timezone-unaware.

       

       

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              rauli Rauli Ruohonen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m

                  Slack

                    Issue deployment