Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8801

[Python] Memory leak on read from parquet file with UTC timestamps using pandas

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.16.0, 0.17.0
    • Fix Version/s: 1.0.0
    • Component/s: Python
    • Environment:
      Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux).

      Description

      Given dump.py script 

       

      import pandas as pd
      import numpy as np
      
      
      x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', utc=True)
      pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', compression=None)
      

      and load.py script

       

      import sys
      import pandas as pd
      
      def foo(engine):
          for _ in range(2**9):
              pd.read_parquet('data.parquet', engine=engine)
          print('Done')
          input()
      
      foo(sys.argv[1])
      

      running first "python dump.py" and then "python load.py pyarrow", on my machine python memory usage stays at 4+ GB while it waits for input. If using "python load.py fastparquet" instead, it is about 100 MB, so it should be a pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is removed from dump.py, in which case the timestamp is timezone-unaware.

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesm Wes McKinney
                Reporter:
                rauli Rauli Ruohonen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m