Details
-
Bug
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
0.16.0, 0.17.0
-
Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux).
Description
Given dump.py script
import pandas as pd import numpy as np x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', utc=True) pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', compression=None)
and load.py script
import sys import pandas as pd def foo(engine): for _ in range(2**9): pd.read_parquet('data.parquet', engine=engine) print('Done') input() foo(sys.argv[1])
running first "python dump.py" and then "python load.py pyarrow", on my machine python memory usage stays at 4+ GB while it waits for input. If using "python load.py fastparquet" instead, it is about 100 MB, so it should be a pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is removed from dump.py, in which case the timestamp is timezone-unaware.
Attachments
Issue Links
- links to