Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9974

[Python][C++] pyarrow version 1.0.1 throws Out Of Memory exception while reading large number of files using ParquetDataset

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Workaround
    • None
    • None
    • C++, Python

    Description

      https://stackoverflow.com/questions/63792849/pyarrow-version-1-0-bug-throws-out-of-memory-exception-while-reading-large-numbe

      I have a dataframe split and stored in more than 5000 files. I use ParquetDataset(fnames).read() to load all files. I updated the pyarrow to latest version 1.0.1 from 0.13.0 and it has started throwing "OSError: Out of memory: malloc of size 131072 failed". The same code on the same machine still works with older version. My machine has 256Gb memory way more than enough to load the data which requires < 10Gb. You can use below code to generate the issue on your side.

      import pandas as pd
      import numpy as np
      import pyarrow.parquet as pq
      
      def generate():
          # create a big dataframe
      
          df = pd.DataFrame({'A': np.arange(50000000)})
          df['F1'] = np.random.randn(50000000) * 100
          df['F2'] = np.random.randn(50000000) * 100
          df['F3'] = np.random.randn(50000000) * 100
          df['F4'] = np.random.randn(50000000) * 100
          df['F5'] = np.random.randn(50000000) * 100
          df['F6'] = np.random.randn(50000000) * 100
          df['F7'] = np.random.randn(50000000) * 100
          df['F8'] = np.random.randn(50000000) * 100
          df['F9'] = 'ABCDEFGH'
          df['F10'] = 'ABCDEFGH'
          df['F11'] = 'ABCDEFGH'
          df['F12'] = 'ABCDEFGH01234'
          df['F13'] = 'ABCDEFGH01234'
          df['F14'] = 'ABCDEFGH01234'
          df['F15'] = 'ABCDEFGH01234567'
          df['F16'] = 'ABCDEFGH01234567'
          df['F17'] = 'ABCDEFGH01234567'
      
          # split and save data to 5000 files
          for i in range(5000):
              df.iloc[i*10000:(i+1)*10000].to_parquet(f'{i}.parquet', index=False)
      
      def read_works():
          # below code works to read
          df = []
          for i in range(5000):
              df.append(pd.read_parquet(f'{i}.parquet'))
      
          df = pd.concat(df)
      
      def read_errors():
          # below code crashes with memory error in pyarrow 1.0/1.0.1 (works fine with version 0.13.0)
          # tried use_legacy_dataset=False, same issue
      
          fnames = []
          for i in range(5000):
              fnames.append(f'{i}.parquet')
      
          len(fnames)
      
          df = pq.ParquetDataset(fnames).read(use_threads=False)
       
      
       

      Attachments

        1. legacy_false.txt
          0.1 kB
          Ashish Gupta
        2. legacy_true.txt
          20 kB
          Ashish Gupta

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kgashish Ashish Gupta
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: