Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Workaround
-
None
-
None
Description
I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.
import pandas as pd import numpy as np import pyarrow.parquet as pq def generate(): # create a big dataframe df = pd.DataFrame({'A': np.arange(50000000)}) df['F1'] = np.random.randn(50000000) * 100 df['F2'] = np.random.randn(50000000) * 100 df['F3'] = np.random.randn(50000000) * 100 df['F4'] = np.random.randn(50000000) * 100 df['F5'] = np.random.randn(50000000) * 100 df['F6'] = np.random.randn(50000000) * 100 df['F7'] = np.random.randn(50000000) * 100 df['F8'] = np.random.randn(50000000) * 100 df['F9'] = 'ABCDEFGH' df['F10'] = 'ABCDEFGH' df['F11'] = 'ABCDEFGH' df['F12'] = 'ABCDEFGH01234' df['F13'] = 'ABCDEFGH01234' df['F14'] = 'ABCDEFGH01234' df['F15'] = 'ABCDEFGH01234567' df['F16'] = 'ABCDEFGH01234567' df['F17'] = 'ABCDEFGH01234567' # split and save data to 5000 files for i in range(5000): df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False) def read_works(): # below code works to read df = [] for i in range(5000): df.append(pd.read_parquet(f'{i}.parquet')) df = pd.concat(df) def read_errors(): # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0) # tried use_legacy_dataset=False, same issue fnames = [] for i in range(5000): fnames.append(f'{i}.parquet') len(fnames) df = pq.ParquetDataset(fnames).read(use_threads=False)
Attachments
Attachments
Issue Links
- is related to
-
ARROW-11049 [Python] Expose alternate memory pools
- Resolved
- relates to
-
ARROW-11228 [C++] Allow fine tuning of memory pool from environment variable(s)
- Open
-
ARROW-11009 [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc
- Resolved