[ARROW-9974] [Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Workaround
Affects Version/s: None
Fix Version/s: None
Component/s: C++, Python
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/26000

Description

https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe

I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.

import pandas as pd
import numpy as np
import pyarrow.parquet as pq

def generate():
    # create a big dataframe

    df = pd.DataFrame({'A': np.arange(50000000)})
    df['F1'] = np.random.randn(50000000) * 100
    df['F2'] = np.random.randn(50000000) * 100
    df['F3'] = np.random.randn(50000000) * 100
    df['F4'] = np.random.randn(50000000) * 100
    df['F5'] = np.random.randn(50000000) * 100
    df['F6'] = np.random.randn(50000000) * 100
    df['F7'] = np.random.randn(50000000) * 100
    df['F8'] = np.random.randn(50000000) * 100
    df['F9'] = 'ABCDEFGH'
    df['F10'] = 'ABCDEFGH'
    df['F11'] = 'ABCDEFGH'
    df['F12'] = 'ABCDEFGH01234'
    df['F13'] = 'ABCDEFGH01234'
    df['F14'] = 'ABCDEFGH01234'
    df['F15'] = 'ABCDEFGH01234567'
    df['F16'] = 'ABCDEFGH01234567'
    df['F17'] = 'ABCDEFGH01234567'

    # split and save data to 5000 files
    for i in range(5000):
        df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)

def read_works():
    # below code works to read
    df = []
    for i in range(5000):
        df.append(pd.read_parquet(f'{i}.parquet'))

    df = pd.concat(df)

def read_errors():
    # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
    # tried use_legacy_dataset=False, same issue

    fnames = []
    for i in range(5000):
        fnames.append(f'{i}.parquet')

    len(fnames)

    df = pq.ParquetDataset(fnames).read(use_threads=False)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

legacy_false.txt
12/Sep/20 15:49
0.1 kB
Ashish Gupta
legacy_true.txt
12/Sep/20 15:49
20 kB
Ashish Gupta

Issue Links

is related to

ARROW-11049 [Python] Expose alternate memory pools

Resolved

relates to

ARROW-11228 [C++] Allow fine tuning of memory pool from environment variable(s)

Open

ARROW-11009 [Python] Add environment variable to elect default usage of system memory allocator instead of jemalloc/mimalloc

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Ashish Gupta

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 11/Sep/20 10:24

Updated:: 11/Jan/23 08:10

Resolved:: 12/Jan/21 20:46